Labeling & Peer Grading: Your homework will be peer graded. To stay anonymous, avoid using your name and label your file with the last four digits of your student ID (e.g., HW2_Solutions_3938).
Submission: Submit both your IPython notebook (.ipynb) and an HTML file of the notebook to Canvas under Assignments → HW 2 → Submit Assignment. After submitting, download and check the files to make sure that you've uploaded the correct versions. Both files are required for your HW to be graded.
AI Use Policy: Solve each problem independently by yourself. Use AI tools like ChatGPT or Google Gemini for brainstorming and learning only—copying AI-generated content is prohibited. You do not neeViolations will lead to penalties, up to failing the course.
Problem Structure: Break down each problem ( already done in most problems) into three interconnected parts and implement each in separate code cells. Ensure that each part logically builds on the previous one. Include comments in your code to explain its purpose, followed by a Markdown cell analyzing what was achieved. After completing all parts, add a final Markdown cell reflecting on your overall approach, discussing any challenges faced, and explaining how you utilized AI tools in your process.
Upload the newimdb data and read the details on the columns about the data following the link https://www.imdb.com/interfaces/.
Analyze the dataset for missing values, particularly in columns like 'runtimeMinutes' and 'genres'. Determine whether to remove rows with missing data or impute values using appropriate statistical methods (mean, median, or mode), and justify your approach. For feature engineering, transform non-numerical columns such as 'genres' and 'directors' using one-hot encoding or label encoding techniques. Consider grouping genres into broader categories to potentially improve model performance. Provide clear explanations for your strategies in handling missing values and implementing feature engineering, ensuring your decisions are data-driven and aligned with the goals of the analysis.
Conduct exploratory data analysis (EDA) on all remaining columns. Handle non-numerical values appropriately, convert data types as needed, and apply transformations or standardization where necessary. Implement dummy coding for categorical variables. For classification purposes, create a new 'Rating' column derived from 'averageRating'. Instead of using a predetermined scale, devise a binning strategy that results in a balanced class distribution across 5 rating categories (1 to 5 stars). Explain your chosen binning method and analyze the resulting distribution. Discuss how this balanced approach may impact your classification model's performance compared to using the original 'averageRating' values.
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
data_loc = "/Users/gabrielvictorgomesferreira/Library/Mobile Documents/com~apple~CloudDocs/Work/ISU Classes/MIS 546 - Advanced Business Analytics/Data/"
file_name = "newimdb.csv"
imdb_df = pd.read_csv(data_loc + file_name)
print("Data dimensions: ", imdb_df.shape)
imdb_df.head(20)
Data dimensions: (2795295, 14)
| Unnamed: 0 | tconst | types | isOriginalTitle | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6475881 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | editor | 8.1 | 63.0 |
| 1 | 6475882 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actor | 8.1 | 63.0 |
| 2 | 6475884 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actress | 8.1 | 63.0 |
| 3 | 6475886 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | director | 8.1 | 63.0 |
| 4 | 6475887 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | writer | 8.1 | 63.0 |
| 5 | 6475889 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | composer | 8.1 | 63.0 |
| 6 | 6475890 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | cinematographer | 8.1 | 63.0 |
| 7 | 6475891 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | editor | 8.1 | 63.0 |
| 8 | 6475892 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actor | 8.1 | 63.0 |
| 9 | 6475894 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actress | 8.1 | 63.0 |
| 10 | 6475896 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | director | 8.1 | 63.0 |
| 11 | 6475897 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | writer | 8.1 | 63.0 |
| 12 | 6475899 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | composer | 8.1 | 63.0 |
| 13 | 6475900 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | cinematographer | 8.1 | 63.0 |
| 14 | 6955478 | tt0118589 | imdbDisplay | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | cinematographer | 2.4 | 23686.0 |
| 15 | 6955479 | tt0118589 | imdbDisplay | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | actress | 2.4 | 23686.0 |
| 16 | 6955480 | tt0118589 | imdbDisplay | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | actor | 2.4 | 23686.0 |
| 17 | 6955483 | tt0118589 | imdbDisplay | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | director | 2.4 | 23686.0 |
| 18 | 6955484 | tt0118589 | imdbDisplay | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | writer | 2.4 | 23686.0 |
| 19 | 6955486 | tt0118589 | imdbDisplay | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | producer | 2.4 | 23686.0 |
imdb_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2795295 entries, 0 to 2795294 Data columns (total 14 columns): # Column Dtype --- ------ ----- 0 Unnamed: 0 int64 1 tconst object 2 types object 3 isOriginalTitle float64 4 titleType object 5 originalTitle object 6 isAdult float64 7 startYear int64 8 runtimeMinutes object 9 genres object 10 directors object 11 category object 12 averageRating float64 13 numVotes float64 dtypes: float64(4), int64(2), object(8) memory usage: 298.6+ MB
imdb_df[imdb_df['originalTitle'] == 'Vazir']
| Unnamed: 0 | tconst | types | isOriginalTitle | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6475881 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | editor | 8.1 | 63.0 |
| 1 | 6475882 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actor | 8.1 | 63.0 |
| 2 | 6475884 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actress | 8.1 | 63.0 |
| 3 | 6475886 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | director | 8.1 | 63.0 |
| 4 | 6475887 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | writer | 8.1 | 63.0 |
| 5 | 6475889 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | composer | 8.1 | 63.0 |
| 6 | 6475890 | tt0111596 | imdbDisplay | 0.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | cinematographer | 8.1 | 63.0 |
| 7 | 6475891 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | editor | 8.1 | 63.0 |
| 8 | 6475892 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actor | 8.1 | 63.0 |
| 9 | 6475894 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actress | 8.1 | 63.0 |
| 10 | 6475896 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | director | 8.1 | 63.0 |
| 11 | 6475897 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | writer | 8.1 | 63.0 |
| 12 | 6475899 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | composer | 8.1 | 63.0 |
| 13 | 6475900 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | cinematographer | 8.1 | 63.0 |
In the above sample, all observations point to the same movie, but there are different title types in types column that only impact the binary isOriginalTitle column, which indicates whether the title is original. This variation doesn't seem to affect any other features. I will further explore this hypothesis using additional examples.
imdb_df['types'].value_counts()
types original 1080513 imdbDisplay 1055184 \N 395137 alternative 85040 working 73868 festival 42436 dvd 36697 tv 21941 video 3283 dvdimdbDisplay 823 festivalimdbDisplay 142 imdbDisplaytv 103 imdbDisplayworking 46 imdbDisplayvideo 26 alternativedvd 20 alternativetv 13 alternativeworking 13 videoworking 8 alternativefestival 2 Name: count, dtype: int64
imdb_df['titleType'].value_counts()
titleType movie 1549698 short 659811 tvMovie 195005 tvSeries 187470 video 112483 videoGame 33181 tvMiniSeries 27829 tvSpecial 23492 tvShort 5360 tvEpisode 966 Name: count, dtype: int64
imdb_df[imdb_df['types'] == 'alternative']
| Unnamed: 0 | tconst | types | isOriginalTitle | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 60 | 6961780 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | composer | 4.9 | 1850.0 |
| 61 | 6961781 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | actor | 4.9 | 1850.0 |
| 62 | 6961784 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | actress | 4.9 | 1850.0 |
| 63 | 6961785 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | director | 4.9 | 1850.0 |
| 64 | 6961786 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | writer | 4.9 | 1850.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2795262 | 32998206 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | actress | 6.4 | 5179.0 |
| 2795263 | 32998207 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | actor | 6.4 | 5179.0 |
| 2795264 | 32998210 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | director | 6.4 | 5179.0 |
| 2795265 | 32998211 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | writer | 6.4 | 5179.0 |
| 2795266 | 32998212 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | producer | 6.4 | 5179.0 |
85040 rows × 14 columns
imdb_df.iloc[0,8]
'\\N'
imdb_df[imdb_df['types'] == '\\N']
| Unnamed: 0 | tconst | types | isOriginalTitle | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | 6955488 | tt0118589 | \N | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | cinematographer | 2.4 | 23686.0 |
| 22 | 6955489 | tt0118589 | \N | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | actress | 2.4 | 23686.0 |
| 23 | 6955490 | tt0118589 | \N | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | actor | 2.4 | 23686.0 |
| 24 | 6955493 | tt0118589 | \N | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | director | 2.4 | 23686.0 |
| 25 | 6955494 | tt0118589 | \N | 0.0 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | writer | 2.4 | 23686.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2795280 | 32998342 | tt9916538 | \N | 0.0 | movie | Kuambil Lagi Hatiku | 0.0 | 2019 | 123 | Drama | nm4457074 | actress | 8.3 | 6.0 |
| 2795281 | 32998344 | tt9916538 | \N | 0.0 | movie | Kuambil Lagi Hatiku | 0.0 | 2019 | 123 | Drama | nm4457074 | actor | 8.3 | 6.0 |
| 2795282 | 32998346 | tt9916538 | \N | 0.0 | movie | Kuambil Lagi Hatiku | 0.0 | 2019 | 123 | Drama | nm4457074 | director | 8.3 | 6.0 |
| 2795283 | 32998347 | tt9916538 | \N | 0.0 | movie | Kuambil Lagi Hatiku | 0.0 | 2019 | 123 | Drama | nm4457074 | writer | 8.3 | 6.0 |
| 2795284 | 32998350 | tt9916538 | \N | 0.0 | movie | Kuambil Lagi Hatiku | 0.0 | 2019 | 123 | Drama | nm4457074 | producer | 8.3 | 6.0 |
395137 rows × 14 columns
imdb_df[imdb_df['originalTitle'] == 'The Attic Expeditions']
| Unnamed: 0 | tconst | types | isOriginalTitle | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 42 | 6961650 | tt0118652 | imdbDisplay | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | composer | 4.9 | 1850.0 |
| 43 | 6961651 | tt0118652 | imdbDisplay | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | actor | 4.9 | 1850.0 |
| 44 | 6961654 | tt0118652 | imdbDisplay | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | actress | 4.9 | 1850.0 |
| 45 | 6961655 | tt0118652 | imdbDisplay | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | director | 4.9 | 1850.0 |
| 46 | 6961656 | tt0118652 | imdbDisplay | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | writer | 4.9 | 1850.0 |
| 47 | 6961657 | tt0118652 | imdbDisplay | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | producer | 4.9 | 1850.0 |
| 48 | 6961680 | tt0118652 | original | 1.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | composer | 4.9 | 1850.0 |
| 49 | 6961681 | tt0118652 | original | 1.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | actor | 4.9 | 1850.0 |
| 50 | 6961684 | tt0118652 | original | 1.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | actress | 4.9 | 1850.0 |
| 51 | 6961685 | tt0118652 | original | 1.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | director | 4.9 | 1850.0 |
| 52 | 6961686 | tt0118652 | original | 1.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | writer | 4.9 | 1850.0 |
| 53 | 6961687 | tt0118652 | original | 1.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | producer | 4.9 | 1850.0 |
| 54 | 6961740 | tt0118652 | dvd | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | composer | 4.9 | 1850.0 |
| 55 | 6961741 | tt0118652 | dvd | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | actor | 4.9 | 1850.0 |
| 56 | 6961744 | tt0118652 | dvd | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | actress | 4.9 | 1850.0 |
| 57 | 6961745 | tt0118652 | dvd | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | director | 4.9 | 1850.0 |
| 58 | 6961746 | tt0118652 | dvd | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | writer | 4.9 | 1850.0 |
| 59 | 6961747 | tt0118652 | dvd | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | producer | 4.9 | 1850.0 |
| 60 | 6961780 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | composer | 4.9 | 1850.0 |
| 61 | 6961781 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | actor | 4.9 | 1850.0 |
| 62 | 6961784 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | actress | 4.9 | 1850.0 |
| 63 | 6961785 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | director | 4.9 | 1850.0 |
| 64 | 6961786 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | writer | 4.9 | 1850.0 |
| 65 | 6961787 | tt0118652 | alternative | 0.0 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | producer | 4.9 | 1850.0 |
imdb_df[imdb_df['originalTitle'] == 'Akelarre']
| Unnamed: 0 | tconst | types | isOriginalTitle | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2795249 | 32998075 | tt9916362 | imdbDisplay | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | composer | 6.4 | 5179.0 |
| 2795250 | 32998076 | tt9916362 | imdbDisplay | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | actress | 6.4 | 5179.0 |
| 2795251 | 32998077 | tt9916362 | imdbDisplay | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | actor | 6.4 | 5179.0 |
| 2795252 | 32998080 | tt9916362 | imdbDisplay | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | director | 6.4 | 5179.0 |
| 2795253 | 32998081 | tt9916362 | imdbDisplay | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | writer | 6.4 | 5179.0 |
| 2795254 | 32998082 | tt9916362 | imdbDisplay | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | producer | 6.4 | 5179.0 |
| 2795255 | 32998125 | tt9916362 | original | 1.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | composer | 6.4 | 5179.0 |
| 2795256 | 32998126 | tt9916362 | original | 1.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | actress | 6.4 | 5179.0 |
| 2795257 | 32998127 | tt9916362 | original | 1.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | actor | 6.4 | 5179.0 |
| 2795258 | 32998130 | tt9916362 | original | 1.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | director | 6.4 | 5179.0 |
| 2795259 | 32998131 | tt9916362 | original | 1.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | writer | 6.4 | 5179.0 |
| 2795260 | 32998132 | tt9916362 | original | 1.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | producer | 6.4 | 5179.0 |
| 2795261 | 32998205 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | composer | 6.4 | 5179.0 |
| 2795262 | 32998206 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | actress | 6.4 | 5179.0 |
| 2795263 | 32998207 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | actor | 6.4 | 5179.0 |
| 2795264 | 32998210 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | director | 6.4 | 5179.0 |
| 2795265 | 32998211 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | writer | 6.4 | 5179.0 |
| 2795266 | 32998212 | tt9916362 | alternative | 0.0 | movie | Akelarre | 0.0 | 2020 | 92 | Drama,History | nm1893148 | producer | 6.4 | 5179.0 |
Based on the above samples/exploration, I identified a consistent pattern where each media entry has an original type along with variants that do not affect the averageRating or numVotes. It repeats the the category over for each unique value in types, which is redundant. I will keep only records for one type in types column, as the variants provide no additional value to the analysis. After removing all records and will drop the types and isOriginalTitle columns as we not longer need them.
# Checking the number of unique titles before the cleaning to make sure we don't lose data
nunique_titles = imdb_df['originalTitle'].nunique()
print("Number of unique titles: ", nunique_titles)
Number of unique titles: 229522
# Removing all observations that are not the original
imdb_original = imdb_df[(imdb_df['types'] == "original") | (imdb_df['types'] == "tv")]
# Checking the number of unique titles after the cleaning to make sure we don't lose data
nunique_imdb_original = imdb_original['originalTitle'].nunique()
# Dataset after trasformation
print("Data dimensions: ", imdb_original.shape)
print("Number of unique titles: ", nunique_imdb_original)
imdb_original.head()
Data dimensions: (1102454, 14) Number of unique titles: 207981
| Unnamed: 0 | tconst | types | isOriginalTitle | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 6475891 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | editor | 8.1 | 63.0 |
| 8 | 6475892 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actor | 8.1 | 63.0 |
| 9 | 6475894 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actress | 8.1 | 63.0 |
| 10 | 6475896 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | director | 8.1 | 63.0 |
| 11 | 6475897 | tt0111596 | original | 1.0 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | writer | 8.1 | 63.0 |
We did lost some midias that did not have original titles. Let's look at them below.
print("Number of records reduce was: ", imdb_df.shape[0] - imdb_original.shape[0])
print("Number of unique titles lost was: ", nunique_titles - nunique_imdb_original)
Number of records reduce was: 1692841 Number of unique titles lost was: 21541
Let's go after the lost records.
# List o unique titles after the cleaning
orig_titles = imdb_original['originalTitle'].unique()
# Movies that don't have original title and therefore are not in the the dataset after the cleaning
movies_no_orig = imdb_df[~(imdb_df['originalTitle'].isin(orig_titles))]
# Checking the number of unique titles that are not in the cleaned df
nunique_movies_no_orig = movies_no_orig['originalTitle'].nunique()
# Dataset after trasformation
print("Data dimensions: ", movies_no_orig.shape)
print("Number of unique titles: ", nunique_movies_no_orig)
movies_no_orig.head()
Data dimensions: (95880, 14) Number of unique titles: 21541
| Unnamed: 0 | tconst | types | isOriginalTitle | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 409 | 7702551 | tt0144449 | \N | 0.0 | movie | Nema aviona za Zagreb | 0.0 | 2012 | 82 | Biography | nm0309428 | cinematographer | 8.2 | 8.0 |
| 410 | 7702552 | tt0144449 | \N | 0.0 | movie | Nema aviona za Zagreb | 0.0 | 2012 | 82 | Biography | nm0309428 | archive_footage | 8.2 | 8.0 |
| 411 | 7702555 | tt0144449 | \N | 0.0 | movie | Nema aviona za Zagreb | 0.0 | 2012 | 82 | Biography | nm0309428 | writer | 8.2 | 8.0 |
| 412 | 7702556 | tt0144449 | \N | 0.0 | movie | Nema aviona za Zagreb | 0.0 | 2012 | 82 | Biography | nm0309428 | producer | 8.2 | 8.0 |
| 1011 | 8274579 | tt0176001 | \N | 0.0 | movie | Panic Bodies | 0.0 | 2003 | 70 | Documentary | nm0393682 | actor | 7.0 | 12.0 |
The unique values match the 21,541 movies that were indentified as missing after selecting the original titles. Let's recover them.
## Get missing movies
titles_N= movies_no_orig[movies_no_orig['types'] == '\\N']
# Checking the number of unique titles that are not in the cleaned df
nunique_titles_N= titles_N['originalTitle'].nunique()
# Dataset after trasformation
print("Data dimensions: ", titles_N.shape)
print("Number of unique titles: ", nunique_titles_N)
titles_N.head()
Data dimensions: (95332, 14) Number of unique titles: 21507
| Unnamed: 0 | tconst | types | isOriginalTitle | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 409 | 7702551 | tt0144449 | \N | 0.0 | movie | Nema aviona za Zagreb | 0.0 | 2012 | 82 | Biography | nm0309428 | cinematographer | 8.2 | 8.0 |
| 410 | 7702552 | tt0144449 | \N | 0.0 | movie | Nema aviona za Zagreb | 0.0 | 2012 | 82 | Biography | nm0309428 | archive_footage | 8.2 | 8.0 |
| 411 | 7702555 | tt0144449 | \N | 0.0 | movie | Nema aviona za Zagreb | 0.0 | 2012 | 82 | Biography | nm0309428 | writer | 8.2 | 8.0 |
| 412 | 7702556 | tt0144449 | \N | 0.0 | movie | Nema aviona za Zagreb | 0.0 | 2012 | 82 | Biography | nm0309428 | producer | 8.2 | 8.0 |
| 1011 | 8274579 | tt0176001 | \N | 0.0 | movie | Panic Bodies | 0.0 | 2003 | 70 | Documentary | nm0393682 | actor | 7.0 | 12.0 |
print("Still missing {} movies" .format(nunique_movies_no_orig - nunique_titles_N))
Still missing 34 movies
movies_no_orig['types'].unique()
array(['\\N', 'imdbDisplay', 'working', 'dvd', 'festival', 'alternative',
'video'], dtype=object)
# List o unique titles after removing original and nulls
last_list = titles_N['originalTitle'].unique()
# Movies that are missing
movies_no_orig_null = movies_no_orig[~(movies_no_orig['originalTitle'].isin(last_list))]
# Checking the number of unique titles that are not in the cleaned df
nunique_missing = movies_no_orig_null['originalTitle'].nunique()
# Display
print("Number of unique titles: ", nunique_missing)
Number of unique titles: 34
movies_no_orig_null['types'].unique()
array(['imdbDisplay', 'working', 'alternative', 'festival'], dtype=object)
# Get last 34 movies
titles_imdb = movies_no_orig_null[movies_no_orig_null['types'] == 'imdbDisplay']
# Confirm the unique count
nunique_titles_imdb = titles_imdb['originalTitle'].nunique()
# Dataset after trasformation
print("Data dimensions: ", titles_imdb.shape)
print("Number of unique titles: ", nunique_titles_imdb)
titles_imdb.head()
Data dimensions: (161, 14) Number of unique titles: 34
| Unnamed: 0 | tconst | types | isOriginalTitle | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25495 | 10485760 | tt0315056 | imdbDisplay | 0.0 | tvSeries | Saia Justa | 0.0 | 2002 | 60 | Talk-Show | nm11597242,nm1381005,nm1166794 | self | 6.7 | 29.0 |
| 59030 | 11239400 | tt0371599 | imdbDisplay | 0.0 | movie | Ein Goldfisch unter Haien | 0.0 | 2004 | 90 | Comedy,Romance | nm0090596 | composer | 6.7 | 32.0 |
| 59031 | 11239401 | tt0371599 | imdbDisplay | 0.0 | movie | Ein Goldfisch unter Haien | 0.0 | 2004 | 90 | Comedy,Romance | nm0090596 | actress | 6.7 | 32.0 |
| 59032 | 11239403 | tt0371599 | imdbDisplay | 0.0 | movie | Ein Goldfisch unter Haien | 0.0 | 2004 | 90 | Comedy,Romance | nm0090596 | actor | 6.7 | 32.0 |
| 59033 | 11239405 | tt0371599 | imdbDisplay | 0.0 | movie | Ein Goldfisch unter Haien | 0.0 | 2004 | 90 | Comedy,Romance | nm0090596 | director | 6.7 | 32.0 |
It got a little confusing, but at this point, we can significantly reduce the volume and redundancy of the data without loosing any significant information. Let's now consolidate the dasets and get all 229,522 initial count of unique titles.
# Concatenanting all datasets
imdb_clean = pd.concat([imdb_original, titles_N, titles_imdb], ignore_index=True)
# Removing unecessary columns
imdb_clean = imdb_clean.drop(columns=['Unnamed: 0', 'types', 'isOriginalTitle'])
# Confirm the unique count
nunique_imdb_clean = imdb_clean['originalTitle'].nunique()
# Dataset after trasformation
print("Data dimensions: ", imdb_clean.shape)
print("Number of unique titles: ", nunique_imdb_clean)
imdb_clean.head()
Data dimensions: (1197947, 11) Number of unique titles: 229522
| tconst | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | category | averageRating | numVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tt0111596 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | editor | 8.1 | 63.0 |
| 1 | tt0111596 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actor | 8.1 | 63.0 |
| 2 | tt0111596 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | actress | 8.1 | 63.0 |
| 3 | tt0111596 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | director | 8.1 | 63.0 |
| 4 | tt0111596 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | writer | 8.1 | 63.0 |
print("Number of records reduced was: ", imdb_df.shape[0] - imdb_clean.shape[0])
print("Number of unique titles lost was: ", nunique_titles - nunique_imdb_clean)
Number of records reduced was: 1597348 Number of unique titles lost was: 0
From the above line, we can see the number of records was significantly reduced without loosing any unique title. \
Moving forward, I still see an opportunity to reduce redundancy further in the dataset. I believe we can have one record per midia if we one-hot encode the category column.
print("We will have {} more columns in the dataset." .format(imdb_clean['category'].nunique() - 2))
We will have 10 more columns in the dataset.
# One-hot encode the 'category' column
imdb_one_hot = pd.get_dummies(imdb_clean, columns=['category'], drop_first=True, dtype=int)
# Group by movie identifiers and aggregate numeric columns
imdb_grouped = imdb_one_hot.groupby(
['tconst', 'titleType', 'originalTitle', 'isAdult', 'startYear', 'runtimeMinutes', 'genres', 'directors'],
as_index=False
).agg({
'averageRating': 'mean',
'numVotes': 'mean',
**{col: 'sum' for col in imdb_one_hot.columns if col.startswith(('category_'))}
})
# Remove the "category_" prefix from the one-hot encoded column names
imdb_grouped.columns = imdb_grouped.columns.str.replace('category_', '')
# Dataset after transformation
print("Data dimensions: ", imdb_grouped.shape)
print("Number of unique titles: ", imdb_grouped['originalTitle'].nunique())
imdb_grouped.head()
Data dimensions: (252763, 21) Number of unique titles: 229522
| tconst | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tt0111596 | movie | Vazir | 0.0 | 2015 | \N | Action,Crime,Drama | nm13648797 | 8.1 | 63.0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 1 | tt0118589 | movie | Glitter | 0.0 | 2001 | 104 | Drama,Music,Romance | nm0193554 | 2.4 | 23686.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
| 2 | tt0118652 | movie | The Attic Expeditions | 0.0 | 2001 | 100 | Comedy,Horror,Mystery | nm0440948 | 4.9 | 1850.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
| 3 | tt0121164 | movie | Corpse Bride | 0.0 | 2005 | 77 | Animation,Drama,Family | nm0000318,nm0425843 | 7.3 | 279860.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 4 | tt0121389 | short | I, an Actress | 0.0 | 2005 | 10 | Short | nm0473647 | 6.0 | 343.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
imdb_grouped[imdb_grouped['originalTitle'].duplicated()]
| tconst | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 986 | tt0292190 | videoGame | Planet of the Apes | 0.0 | 2001 | \N | Action,Adventure,Sci-Fi | \N | 5.3 | 100.0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1013 | tt0292508 | movie | Fidel | 0.0 | 2001 | 91 | Biography,Documentary | nm0106073 | 7.5 | 558.0 | 0 | 0 | 0 | 2 | 2 | 2 | 0 | 2 | 0 | 2 | 0 |
| 1300 | tt0303354 | movie | Killing Time | 0.0 | 2002 | 81 | \N | nm1097510 | 5.9 | 60.0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
| 1412 | tt0303938 | movie | Dust | 0.0 | 2001 | 90 | Comedy,Horror,Thriller | nm0556573 | 4.6 | 98.0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 1521 | tt0315355 | movie | The Calling | 0.0 | 2002 | 117 | Drama,Family | nm1161445 | 7.2 | 13.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 252737 | tt9913936 | movie | Paradise | 0.0 | 2019 | 135 | Crime,Drama | nm10091596 | 7.5 | 59.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 252742 | tt9914642 | movie | Albatross | 0.0 | 2017 | 97 | Documentary | nm5300859 | 8.3 | 42.0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 252748 | tt9915138 | tvSeries | Scandalous | 0.0 | 2018 | 60 | Crime,Documentary | nm7710846 | 6.6 | 11.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 252753 | tt9916038 | short | Eco | 0.0 | 2019 | 14 | Drama,Short | nm1517212 | 6.4 | 15.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
| 252760 | tt9916460 | tvMovie | Pink Taxi | 0.0 | 2019 | \N | Comedy | nm7048843 | 9.4 | 18.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
23241 rows × 21 columns
imdb_grouped[imdb_grouped['originalTitle'] == 'Planet of the Apes']
| tconst | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | tt0133152 | movie | Planet of the Apes | 0.0 | 2001 | 119 | Action,Adventure,Sci-Fi | nm0000318 | 5.7 | 221998.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 986 | tt0292190 | videoGame | Planet of the Apes | 0.0 | 2001 | \N | Action,Adventure,Sci-Fi | \N | 5.3 | 100.0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Now that we got to the point of one record per movie, we can explore ways of working with genres, directors, and titleType, which are categorical columns.
genres¶Here, I'll create a numeric mapping for genres and split the genres into up to three columns (genre1, genre2, genre3), each representing a unique numeric value for a genre. If a movie had less than three genres, the remaining columns were filled with 0. If it has more, I will just use the first 3.
# Split the genres string by commas and explode the resulting list into individual rows
genres_split = imdb_grouped['genres'].str.split(',')
# Flatten the list of genres and get unique values
unique_genres = pd.Series([genre.strip() for sublist in genres_split for genre in sublist]).unique()
# Display the unique genres
print(unique_genres)
['Action' 'Crime' 'Drama' 'Music' 'Romance' 'Comedy' 'Horror' 'Mystery' 'Animation' 'Family' 'Short' 'Adventure' 'Fantasy' 'Thriller' 'Sci-Fi' 'Documentary' '\\N' 'Biography' 'Sport' 'History' 'War' 'Western' 'Musical' 'Talk-Show' 'Adult' 'Reality-TV' 'Game-Show' 'News']
# Create a mapping of genres to numeric values
genre_mapping = {
'Null': 0, 'Action': 1, 'Crime': 2, 'Drama': 3, 'Music': 4, 'Romance': 5, 'Comedy': 6,
'Horror': 7, 'Mystery': 8, 'Animation': 9, 'Family': 10, 'Short': 11, 'Adventure': 12,
'Fantasy': 13, 'Thriller': 14, 'Sci-Fi': 15, 'Documentary': 16, 'Biography': 17, 'Sport': 18,
'History': 19, 'War': 20, 'Western': 21, 'Musical': 22, 'Talk-Show': 23, 'Adult': 24,
'Reality-TV': 25, 'Game-Show': 26, 'News': 27, '\\N': 0
}
# Create a function to split genres and assign numeric values to new columns
def assign_genres(genres_list):
# Split the genres and map to numbers
mapped_genres = [genre_mapping[genre.strip()] for genre in genres_list if genre in genre_mapping]
# Assign up to 3 genres (fill with 0 if there are fewer than 3 genres)
return mapped_genres[:3] + [0] * (3 - len(mapped_genres))
# Apply the function to create new genre columns
imdb_grouped[['genre1', 'genre2', 'genre3']] = imdb_grouped['genres'].str.split(',').apply(assign_genres).apply(pd.Series)
# Display the transformed data
imdb_grouped[['genres', 'genre1', 'genre2', 'genre3']]
| genres | genre1 | genre2 | genre3 | |
|---|---|---|---|---|
| 0 | Action,Crime,Drama | 1 | 2 | 3 |
| 1 | Drama,Music,Romance | 3 | 4 | 5 |
| 2 | Comedy,Horror,Mystery | 6 | 7 | 8 |
| 3 | Animation,Drama,Family | 9 | 3 | 10 |
| 4 | Short | 11 | 0 | 0 |
| ... | ... | ... | ... | ... |
| 252758 | Thriller | 14 | 0 | 0 |
| 252759 | Drama,History | 3 | 19 | 0 |
| 252760 | Comedy | 6 | 0 | 0 |
| 252761 | Drama | 3 | 0 | 0 |
| 252762 | \N | 0 | 0 | 0 |
252763 rows × 4 columns
directors¶Here, I will split the directors column by removing the 'nm' prefix and create two columns (director1, director2). If only one director was listed, the second column was set to 0. Missing values ('\N') were handled by assigning 0.
# Create a function to process the directors column
def assign_directors(directors_list):
# Remove 'nm' prefix and split by commas
processed_directors = [int(director.strip().replace('nm', '')) if director != '\\N' else 0 for director in directors_list.split(',')]
# Assign up to 2 directors and fill with 0 if there is only one director
return processed_directors[:2] + [0] * (2 - len(processed_directors))
# Apply the function to create new director columns
imdb_grouped[['director1', 'director2']] = imdb_grouped['directors'].apply(assign_directors).apply(pd.Series).astype(int)
# Display the transformed data
imdb_grouped[['directors', 'director1', 'director2']].head()
| directors | director1 | director2 | |
|---|---|---|---|
| 0 | nm13648797 | 13648797 | 0 |
| 1 | nm0193554 | 193554 | 0 |
| 2 | nm0440948 | 440948 | 0 |
| 3 | nm0000318,nm0425843 | 318 | 425843 |
| 4 | nm0473647 | 473647 | 0 |
titleType¶Here, I will map each unique value in the titleType column to a corresponding numeric value using a dictionary, and create a new column (titleType_encoded) for this numeric column.
imdb_grouped['titleType'].unique()
array(['movie', 'short', 'video', 'tvMovie', 'tvMiniSeries', 'tvSeries',
'videoGame', 'tvSpecial', 'tvShort', 'tvEpisode'], dtype=object)
# Create a mapping of titleType to numeric values
titleType_mapping = {
'movie': 1,
'short': 2,
'video': 3,
'tvMovie': 4,
'tvMiniSeries': 5,
'tvSeries': 6,
'videoGame': 7,
'tvSpecial': 8,
'tvShort': 9,
'tvEpisode': 10
}
# Apply the mapping to the 'titleType' column
imdb_grouped['titleType_encoded'] = imdb_grouped['titleType'].map(titleType_mapping)
# Display the transformed data
imdb_grouped[['titleType', 'titleType_encoded']].head()
| titleType | titleType_encoded | |
|---|---|---|
| 0 | movie | 1 |
| 1 | movie | 1 |
| 2 | movie | 1 |
| 3 | movie | 1 |
| 4 | short | 2 |
runtimeMinutes¶Here, I will fill null values in runtimeMinutes with the average runtime per title in titleType column.
imdb_grouped['runtimeMinutes'].value_counts()
runtimeMinutes
\N 43683
90 9680
60 6489
30 5156
15 4832
...
311 1
413 1
624 1
529 1
462 1
Name: count, Length: 546, dtype: int64
# Get average runtime for each title type
imdb_grouped['runtimeMinutes'] = np.where(imdb_grouped['runtimeMinutes'] == '\\N', np.nan, imdb_grouped['runtimeMinutes']).astype(float)
avg_runtime = imdb_grouped.dropna()
avg_runtime = avg_runtime.groupby(['titleType']).agg(runtimeMinutes_avg=('runtimeMinutes', 'mean')).reset_index()
avg_runtime.head()
| titleType | runtimeMinutes_avg | |
|---|---|---|
| 0 | movie | 95.903562 |
| 1 | short | 15.353863 |
| 2 | tvEpisode | 68.024096 |
| 3 | tvMiniSeries | 105.786276 |
| 4 | tvMovie | 80.757631 |
# Filling Null values
imdb_merged = imdb_grouped.merge(avg_runtime, on=['titleType'], how='left')
imdb_merged['runtimeMinutes'] = np.where(imdb_merged['runtimeMinutes'].isna(), imdb_merged['runtimeMinutes_avg'], imdb_merged['runtimeMinutes'])
imdb_merged[imdb_merged['runtimeMinutes'].isna()]
| tconst | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | runtimeMinutes_avg |
|---|
imdb_merged
| tconst | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | runtimeMinutes_avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tt0111596 | movie | Vazir | 0.0 | 2015 | 95.903562 | Action,Crime,Drama | nm13648797 | 8.1 | 63.0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | 3 | 13648797 | 0 | 1 | 95.903562 |
| 1 | tt0118589 | movie | Glitter | 0.0 | 2001 | 104.000000 | Drama,Music,Romance | nm0193554 | 2.4 | 23686.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 3 | 4 | 5 | 193554 | 0 | 1 | 95.903562 |
| 2 | tt0118652 | movie | The Attic Expeditions | 0.0 | 2001 | 100.000000 | Comedy,Horror,Mystery | nm0440948 | 4.9 | 1850.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 6 | 7 | 8 | 440948 | 0 | 1 | 95.903562 |
| 3 | tt0121164 | movie | Corpse Bride | 0.0 | 2005 | 77.000000 | Animation,Drama,Family | nm0000318,nm0425843 | 7.3 | 279860.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 3 | 10 | 318 | 425843 | 1 | 95.903562 |
| 4 | tt0121389 | short | I, an Actress | 0.0 | 2005 | 10.000000 | Short | nm0473647 | 6.0 | 343.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 473647 | 0 | 2 | 15.353863 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 252758 | tt9916270 | movie | Il talento del calabrone | 0.0 | 2020 | 84.000000 | Thriller | nm1480867 | 5.8 | 1393.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 14 | 0 | 0 | 1480867 | 0 | 1 | 95.903562 |
| 252759 | tt9916362 | movie | Akelarre | 0.0 | 2020 | 92.000000 | Drama,History | nm1893148 | 6.4 | 5179.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 3 | 19 | 0 | 1893148 | 0 | 1 | 95.903562 |
| 252760 | tt9916460 | tvMovie | Pink Taxi | 0.0 | 2019 | 80.757631 | Comedy | nm7048843 | 9.4 | 18.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 7048843 | 0 | 4 | 80.757631 |
| 252761 | tt9916538 | movie | Kuambil Lagi Hatiku | 0.0 | 2019 | 123.000000 | Drama | nm4457074 | 8.3 | 6.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 3 | 0 | 0 | 4457074 | 0 | 1 | 95.903562 |
| 252762 | tt9916730 | movie | 6 Gunn | 0.0 | 2017 | 116.000000 | \N | nm10538612 | 8.1 | 9.0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 10538612 | 0 | 1 | 95.903562 |
252763 rows × 28 columns
numVotes¶No null values found.
imdb_merged[imdb_merged['numVotes'] == '\\N']
| tconst | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | runtimeMinutes_avg |
|---|
imdb_merged[imdb_merged['numVotes'].isna()]
| tconst | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | runtimeMinutes_avg |
|---|
startYear¶No null values found.
imdb_merged[imdb_merged['startYear'].isna()]
| tconst | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | runtimeMinutes_avg |
|---|
imdb_merged['startYear'].value_counts()
startYear 2016 16411 2017 15878 2015 15711 2013 15323 2014 15120 2012 15035 2011 14496 2018 13794 2010 13454 2019 13097 2021 12870 2009 12868 2020 12750 2008 11462 2022 11308 2007 10088 2006 9558 2005 7534 2004 5409 2003 4295 2002 3075 2001 2457 2023 770 Name: count, dtype: int64
rating column derived from averageRating¶To classify movies into 5 star-based rating categories, I applied quantile-based binning to the averageRating column since averageRating is a continuous variable amd cannot be used directly in a classification model. This approach divides the data into five groups of similar size, with each bin representing about 20% of the ratings. By using quantiles, the process ensures that each category contains a similar number of movies, resulting in a balanced distribution. This approach may improve the model’s performance by preventing bias toward overrepresented ratings and makes the data suitable for classification.
# Analyze the distribution of averageRating
imdb_merged['averageRating'].describe()
count 252763.000000 mean 6.652431 std 1.518086 min 1.000000 25% 5.800000 50% 6.800000 75% 7.700000 max 10.000000 Name: averageRating, dtype: float64
# Create a new rating column using quantile-based binning
imdb_merged['rating'], bin_edges = pd.qcut(imdb_merged['averageRating'], q=5, labels=[1, 2, 3, 4, 5], retbins=True)
imdb_merged['rating'] = imdb_merged['rating'].astype(int)
# Display the bin edges (ranges)
print("Stars average ratings range: ")
for i in range(1, len(bin_edges)):
print(f"{i} star: {bin_edges[i-1]:.2f} to {bin_edges[i]:.2f}")
print("")
# Check the distribution of the new rating column
print("New rating distribution: ")
print(imdb_merged['rating'].value_counts())
Stars average ratings range: 1 star: 1.00 to 5.50 2 star: 5.50 to 6.40 3 star: 6.40 to 7.10 4 star: 7.10 to 7.90 5 star: 7.90 to 10.00 New rating distribution: rating 1 53403 4 50778 5 50101 3 49547 2 48934 Name: count, dtype: int64
imdb_merged.head()
| tconst | titleType | originalTitle | isAdult | startYear | runtimeMinutes | genres | directors | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | runtimeMinutes_avg | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tt0111596 | movie | Vazir | 0.0 | 2015 | 95.903562 | Action,Crime,Drama | nm13648797 | 8.1 | 63.0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | 3 | 13648797 | 0 | 1 | 95.903562 | 5 |
| 1 | tt0118589 | movie | Glitter | 0.0 | 2001 | 104.000000 | Drama,Music,Romance | nm0193554 | 2.4 | 23686.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 3 | 4 | 5 | 193554 | 0 | 1 | 95.903562 | 1 |
| 2 | tt0118652 | movie | The Attic Expeditions | 0.0 | 2001 | 100.000000 | Comedy,Horror,Mystery | nm0440948 | 4.9 | 1850.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 6 | 7 | 8 | 440948 | 0 | 1 | 95.903562 | 1 |
| 3 | tt0121164 | movie | Corpse Bride | 0.0 | 2005 | 77.000000 | Animation,Drama,Family | nm0000318,nm0425843 | 7.3 | 279860.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 3 | 10 | 318 | 425843 | 1 | 95.903562 | 4 |
| 4 | tt0121389 | short | I, an Actress | 0.0 | 2005 | 10.000000 | Short | nm0473647 | 6.0 | 343.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 473647 | 0 | 2 | 15.353863 | 2 |
# Removing unecessary columns
imdb_final = imdb_merged.drop(columns=['tconst', 'originalTitle', 'titleType', 'genres', 'directors', 'runtimeMinutes_avg'])
imdb_final
| isAdult | startYear | runtimeMinutes | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 2015 | 95.903562 | 8.1 | 63.0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | 3 | 13648797 | 0 | 1 | 5 |
| 1 | 0.0 | 2001 | 104.000000 | 2.4 | 23686.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 3 | 4 | 5 | 193554 | 0 | 1 | 1 |
| 2 | 0.0 | 2001 | 100.000000 | 4.9 | 1850.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 6 | 7 | 8 | 440948 | 0 | 1 | 1 |
| 3 | 0.0 | 2005 | 77.000000 | 7.3 | 279860.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 3 | 10 | 318 | 425843 | 1 | 4 |
| 4 | 0.0 | 2005 | 10.000000 | 6.0 | 343.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 473647 | 0 | 2 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 252758 | 0.0 | 2020 | 84.000000 | 5.8 | 1393.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 14 | 0 | 0 | 1480867 | 0 | 1 | 2 |
| 252759 | 0.0 | 2020 | 92.000000 | 6.4 | 5179.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 3 | 19 | 0 | 1893148 | 0 | 1 | 2 |
| 252760 | 0.0 | 2019 | 80.757631 | 9.4 | 18.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 7048843 | 0 | 4 | 5 |
| 252761 | 0.0 | 2019 | 123.000000 | 8.3 | 6.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 3 | 0 | 0 | 4457074 | 0 | 1 | 5 |
| 252762 | 0.0 | 2017 | 116.000000 | 8.1 | 9.0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 10538612 | 0 | 1 | 5 |
252763 rows × 23 columns
imdb_final.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 252763 entries, 0 to 252762 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 isAdult 252763 non-null float64 1 startYear 252763 non-null int64 2 runtimeMinutes 252763 non-null float64 3 averageRating 252763 non-null float64 4 numVotes 252763 non-null float64 5 actress 252763 non-null int64 6 archive_footage 252763 non-null int64 7 archive_sound 252763 non-null int64 8 cinematographer 252763 non-null int64 9 composer 252763 non-null int64 10 director 252763 non-null int64 11 editor 252763 non-null int64 12 producer 252763 non-null int64 13 production_designer 252763 non-null int64 14 self 252763 non-null int64 15 writer 252763 non-null int64 16 genre1 252763 non-null int64 17 genre2 252763 non-null int64 18 genre3 252763 non-null int64 19 director1 252763 non-null int64 20 director2 252763 non-null int64 21 titleType_encoded 252763 non-null int64 22 rating 252763 non-null int64 dtypes: float64(4), int64(19) memory usage: 44.4 MB
file_name = "newimdb_clean.csv"
imdb_final.to_csv(data_loc + file_name, index=False)
You will make regression models here using the cleaned data from problem 1. Use averageRating as the target variable for this problem.
Begin by selecting five advanced machine learning models for multiple linear/nonlinear regression, excluding basic models like OLS, Decision Tree, Random Forest, K-Nearest Neighbors, and SVM. Determine appropriate test size and validation method, providing a clear rationale for your choices. Split your data into training and testing sets based on these decisions. Use 'averageRating' as your target variable and preprocess your feature set as required by your selected models.
Implement each of your five chosen regression models using the prepared data. Train each model on the training set and generate predictions for the test set. Create a dataframe that combines the predictions from all five models for the test data, and display the first few rows of this dataframe.
Evaluate the performance of all five models by calculating the Root Mean Square Error (RMSE) and R-squared values for both the training and testing data. Present these metrics in a clear, tabular format for easy comparison. Based on these results, select the best-performing model and provide a detailed explanation for your choice. Finally, critically assess whether your best model is suitable for production use. Discuss its strengths and limitations, and propose specific strategies or techniques you would employ to further improve the model's performance.
# Standards
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
# Data prep for modeling
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
# Modeling
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import permutation_importance
from sklearn.neural_network import MLPRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor, plot_importance
from catboost import CatBoostRegressor
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# Evaluation
from sklearn.metrics import mean_squared_error, r2_score, make_scorer
2024-09-28 22:15:53.779991: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
# Data location
data_loc = "/Users/gabrielvictorgomesferreira/Library/Mobile Documents/com~apple~CloudDocs/Work/ISU Classes/MIS 546 - Advanced Business Analytics/Data/"
file_name = "newimdb_clean.csv"
# Import dataset
clean_df = pd.read_csv(data_loc + file_name)
clean_df.head()
| isAdult | startYear | runtimeMinutes | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 2015 | 95.903562 | 8.1 | 63.0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | 3 | 13648797 | 0 | 1 | 5 |
| 1 | 0.0 | 2001 | 104.000000 | 2.4 | 23686.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 3 | 4 | 5 | 193554 | 0 | 1 | 1 |
| 2 | 0.0 | 2001 | 100.000000 | 4.9 | 1850.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 6 | 7 | 8 | 440948 | 0 | 1 | 1 |
| 3 | 0.0 | 2005 | 77.000000 | 7.3 | 279860.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 3 | 10 | 318 | 425843 | 1 | 4 |
| 4 | 0.0 | 2005 | 10.000000 | 6.0 | 343.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 473647 | 0 | 2 | 2 |
# Predictor Features
X = clean_df.drop(columns=['averageRating', 'rating'])
# Tagert Feature
y = clean_df['averageRating']
# Applying SelectKBest to extract top 19 features
bestfeatures = SelectKBest(score_func=f_regression, k=19)
KfeatureScores = pd.DataFrame({'Features': X.columns, 'Score': bestfeatures.fit(X, y).scores_}).sort_values(by=['Score'], ascending=False)
# Printing top 19 features with the highest scores
print(KfeatureScores.nlargest(19, 'Score'))
Features Score 4 actress 6319.151088 18 director1 4530.070883 20 titleType_encoded 4413.634142 14 writer 4155.665752 13 self 3626.197880 11 producer 2622.132965 15 genre1 2302.100806 9 director 2082.130535 8 composer 1329.835507 16 genre2 985.801154 19 director2 931.539014 2 runtimeMinutes 463.657343 5 archive_footage 322.470430 1 startYear 262.592681 7 cinematographer 155.744928 3 numVotes 148.320284 12 production_designer 58.444015 10 editor 30.329249 0 isAdult 28.167798
# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
I selected an 80/20 train-test split for the data, where 80% is used for training and 20% for testing. This provides a good balance, ensuring that the model has enough data to learn from while having also enough data for an unbiased evaluation of performance on unseen samples.
For model validation, I chose 5-Fold Cross-Validation on the training set. This method divides the training data into 5 subsets, where the model is trained on 4 subsets and validated on the remaining one. This process is repeated 5 times, in which each subset is treated as a validation set at least once. At the end, the average results across all folds helps on the assessment of the model’s generalization capability, which may reduce variance and mitigate the risk of overfitting.
# Building the Gradient Boosting model
gb_feature_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gb_feature_model.fit(X_train, y_train)
# Extract feature importance and select top 10 features
gb_feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': gb_feature_model.feature_importances_
})
gb_top_10_features = gb_feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True).head(10)
# Print the top 10 feature names and their importance scores
print("Top 10 Features by Importance (Gradient Boosting):")
print(gb_top_10_features)
# Visualize the top 10 features
plt.figure(figsize=(10, 6))
plt.barh(gb_top_10_features['Feature'], gb_top_10_features['Importance'], color='skyblue')
plt.xlabel("Importance Score")
plt.ylabel("Features")
plt.title("Top 10 Feature Importance (Gradient Boosting Regressor)")
plt.gca().invert_yaxis()
plt.show()
Top 10 Features by Importance (Gradient Boosting):
Feature Importance
0 genre1 0.272489
1 runtimeMinutes 0.256218
2 numVotes 0.250386
3 titleType_encoded 0.092470
4 genre2 0.038440
5 actress 0.026522
6 director1 0.023518
7 startYear 0.019659
8 self 0.015823
9 genre3 0.001904
# Building the Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(X_train, y_train)
# Making predictions on the train and test set
gb_y_train_pred = gb_model.predict(X_train)
gb_y_test_pred = gb_model.predict(X_test)
# Evaluating the model
gb_mse_train = mean_squared_error(y_train, gb_y_train_pred)
gb_mse_test = mean_squared_error(y_test, gb_y_test_pred)
gb_r2_train = r2_score(y_train, gb_y_train_pred)
gb_r2_test = r2_score(y_test, gb_y_test_pred)
print(f'Mean Squared Error -- Train: {gb_mse_train} | Test: {gb_mse_test}')
print(f'R-squared Score -- Train: {gb_r2_train} | Test: {gb_r2_test}')
Mean Squared Error -- Train: 1.6899126696987556 | Test: 1.7160659301327499 R-squared Score -- Train: 0.2661201788830251 | Test: 0.2577004038269062
# select top features
gb_best_features = gb_top_10_features['Feature'][0:4]
X_train_gb = X_train[gb_best_features]
X_test_gb = X_test[gb_best_features]
# Building the Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# Define RMSE as a scoring function
rmse_scorer = make_scorer(mean_squared_error, greater_is_better=False, squared=False) # squared=False returns RMSE directly
## Cross Validation
# Evaluate using RMSE
gb_rmse_scores = cross_val_score(gb_model, X_train_gb, y_train, cv=5, scoring=rmse_scorer)
gb_average_rmse = -gb_rmse_scores.mean()
print(f"\nCross-Validation (5-Fold) - Gradient Boosting")
print(f"{'Metric':<30} {'Fold Scores':<50} {'Mean Score':<10}")
print(f"{'-'*95}")
print(f"RMSE Scores: {'':<12} {np.round(-gb_rmse_scores, 4)} {'':<18} {gb_average_rmse:.4f}")
# Evaluate using r2
r2_scores = cross_val_score(gb_model, X_train_gb, y_train, cv=5, scoring='r2')
average_r2 = r2_scores.mean()
print(f"R² Scores: {'':<14} {np.round(r2_scores, 4)} {'':<18} {average_r2:.4f}")
print('-' * 95)
Cross-Validation (5-Fold) - Gradient Boosting Metric Fold Scores Mean Score ----------------------------------------------------------------------------------------------- RMSE Scores: [1.3254 1.3269 1.3204 1.3101 1.3126] 1.3191 R² Scores: [0.2435 0.2422 0.2428 0.252 0.2413] 0.2444 -----------------------------------------------------------------------------------------------
## Training with GB top features
# Fit the model on the entire training data
gb_model.fit(X_train_gb, y_train)
# Making predictions on the train and test set
gb_y_train_pred = gb_model.predict(X_train_gb)
gb_y_test_pred = gb_model.predict(X_test_gb)
# Evaluating the model
gb_rmse_train = np.sqrt(mean_squared_error(y_train, gb_y_train_pred))
gb_rmse_test = np.sqrt(mean_squared_error(y_test, gb_y_test_pred))
gb_r2_train = r2_score(y_train, gb_y_train_pred)
gb_r2_test = r2_score(y_test, gb_y_test_pred)
print(f"\nFinal Model Performance - Gradient Boosting")
print(f"{'Metric':<30} {'Train Score':<20} {'Test Score':<20}")
print(f"{'-'*60}")
print(f"RMSE (Root Mean Squared Error): {'':<3} {gb_rmse_train:.4f} {'':<10} {gb_rmse_test:.4f}")
print(f"R² (R-squared): {'':<19} {gb_r2_train:.4f} {'':<10} {gb_r2_test:.4f}")
Final Model Performance - Gradient Boosting Metric Train Score Test Score ------------------------------------------------------------ RMSE (Root Mean Squared Error): 1.3166 1.3252 R² (R-squared): 0.2472 0.2404
# Building the LGBMRegressor model
lgbm_feature_model = LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42, verbose=-1)
lgbm_feature_model.fit(X_train, y_train)
# Extract feature importance and select top 10 features
lgbm_feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': lgbm_feature_model.feature_importances_
})
lgbm_top_10_features = lgbm_feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True).head(10)
# Print the top 10 feature names and their importance scores
print("Top 10 Features by Importance (LGBMRegressor):")
print(lgbm_top_10_features)
# Visualize the top 10 features
plt.figure(figsize=(10, 6))
plt.barh(lgbm_top_10_features['Feature'], lgbm_top_10_features['Importance'], color='skyblue')
plt.xlabel("Importance Score")
plt.ylabel("Features")
plt.title("Top 10 Feature Importance (LightGBM)")
plt.gca().invert_yaxis()
plt.show()
Top 10 Features by Importance (LGBMRegressor):
Feature Importance
0 genre1 572
1 runtimeMinutes 493
2 numVotes 414
3 genre2 293
4 director1 253
5 startYear 244
6 titleType_encoded 216
7 genre3 150
8 director2 94
9 self 52
# select top features
lgbm_best_features = lgbm_top_10_features['Feature'][:]
X_train_lgbm = X_train[lgbm_best_features]
X_test_lgbm = X_test[lgbm_best_features]
# Building the LGBMRegressor model
lgbm_model = LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42, verbose=-1)
# Define RMSE as a scoring function
rmse_scorer = make_scorer(mean_squared_error, greater_is_better=False, squared=False) # squared=False returns RMSE directly
## Cross Validation
# Evaluate using RMSE
lgbm_rmse_scores = cross_val_score(lgbm_model, X_train_lgbm, y_train, cv=5, scoring=rmse_scorer)
lgbm_average_rmse = -lgbm_rmse_scores.mean()
print(f"\nCross-Validation (5-Fold) - LightGBM Regressor")
print(f"{'Metric':<30} {'Fold Scores':<50} {'Mean Score':<10}")
print(f"{'-'*95}")
print(f"RMSE Scores: {'':<12} {np.round(-lgbm_rmse_scores, 4)} {'':<18} {lgbm_average_rmse:.4f}")
# Evaluate using r2
r2_scores = cross_val_score(lgbm_model, X_train_lgbm, y_train, cv=5, scoring='r2')
average_r2 = r2_scores.mean()
print(f"R² Scores: {'':<14} {np.round(r2_scores, 4)} {'':<18} {average_r2:.4f}")
print('-' * 95)
Cross-Validation (5-Fold) - LightGBM Regressor Metric Fold Scores Mean Score ----------------------------------------------------------------------------------------------- RMSE Scores: [1.2791 1.2828 1.2766 1.2657 1.2684] 1.2745 R² Scores: [0.2955 0.2918 0.2922 0.3019 0.2915] 0.2946 -----------------------------------------------------------------------------------------------
## Training with LGBM top features
# Fit the model on the entire training data
lgbm_model.fit(X_train_lgbm, y_train)
# Making predictions on the train and test set
lgbm_y_train_pred = lgbm_model.predict(X_train_lgbm)
lgbm_y_test_pred = lgbm_model.predict(X_test_lgbm)
# Evaluating the model
lgbm_rmse_train = np.sqrt(mean_squared_error(y_train, lgbm_y_train_pred))
lgbm_rmse_test = np.sqrt(mean_squared_error(y_test, lgbm_y_test_pred))
lgbm_r2_train = r2_score(y_train, lgbm_y_train_pred)
lgbm_r2_test = r2_score(y_test, lgbm_y_test_pred)
print(f"\nFinal Model Performance - LightGBM Regressor")
print(f"{'Metric':<30} {'Train Score':<20} {'Test Score':<20}")
print(f"{'-'*60}")
print(f"RMSE (Root Mean Squared Error): {'':<3} {lgbm_rmse_train:.4f} {'':<10} {lgbm_rmse_test:.4f}")
print(f"R² (R-squared): {'':<19} {lgbm_r2_train:.4f} {'':<10} {lgbm_r2_test:.4f}")
Final Model Performance - LightGBM Regressor Metric Train Score Test Score ------------------------------------------------------------ RMSE (Root Mean Squared Error): 1.2626 1.2810 R² (R-squared): 0.3077 0.2902
# Building the XGBRegressor model
xgb_feature_model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
xgb_feature_model.fit(X_train, y_train)
# Extract feature importance using get_booster().get_score()
booster = xgb_feature_model.get_booster()
importance_dict = booster.get_score(importance_type='gain')
# Convert the importance dictionary to a DataFrame
feature_importance_df = pd.DataFrame({
'Feature': [f for f in importance_dict.keys()],
'Importance': [importance_dict[f] for f in importance_dict.keys()]
})
# Sort by importance and select top 10
xgb_top_10_features = feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True).head(10)
# Print the top 10 feature names and their importance scores
print("Top 10 Features by Importance (XGBoost):")
print(xgb_top_10_features)
# Visualize the top 10 features
plt.figure(figsize=(10, 6))
plt.barh(xgb_top_10_features['Feature'], xgb_top_10_features['Importance'], color='skyblue')
plt.xlabel("Gain (Importance Score)")
plt.ylabel("Features")
plt.title("Top 10 Feature Importance (XGBoost)")
plt.gca().invert_yaxis()
plt.show()
Top 10 Features by Importance (XGBoost):
Feature Importance
0 genre1 271.680481
1 runtimeMinutes 230.020172
2 numVotes 177.048859
3 actress 140.087158
4 titleType_encoded 134.178329
5 self 101.458420
6 genre2 69.394676
7 startYear 38.411686
8 director1 33.888279
9 isAdult 28.239347
# select top features
xgb_best_features = xgb_top_10_features['Feature'][0:10]
X_train_xgb = X_train[xgb_best_features]
X_test_xgb = X_test[xgb_best_features]
# Building the XGBRegressor model
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# Define RMSE as a scoring function
rmse_scorer = make_scorer(mean_squared_error, greater_is_better=False, squared=False) # squared=False returns RMSE directly
## Cross Validation
# Evaluate using RMSE
xgb_rmse_scores = cross_val_score(xgb_model, X_train_xgb, y_train, cv=5, scoring=rmse_scorer)
xgb_average_rmse = -xgb_rmse_scores.mean()
print(f"\nCross-Validation (5-Fold) - XGBoost Regressor")
print(f"{'Metric':<30} {'Fold Scores':<50} {'Mean Score':<10}")
print(f"{'-'*95}")
print(f"RMSE Scores: {'':<12} {np.round(-xgb_rmse_scores, 4)} {'':<18} {xgb_average_rmse:.4f}")
# Evaluate using r2
r2_scores = cross_val_score(xgb_model, X_train_xgb, y_train, cv=5, scoring='r2')
average_r2 = r2_scores.mean()
print(f"R² Scores: {'':<14} {np.round(r2_scores, 4)} {'':<18} {average_r2:.4f}")
print('-' * 95)
Cross-Validation (5-Fold) - XGBoost Regressor Metric Fold Scores Mean Score ----------------------------------------------------------------------------------------------- RMSE Scores: [1.2781 1.2827 1.2758 1.2652 1.2683] 1.2740 R² Scores: [0.2966 0.2919 0.2931 0.3024 0.2916] 0.2951 -----------------------------------------------------------------------------------------------
## Training with XGB top features
# Fit the model on the entire training data
xgb_model.fit(X_train_xgb, y_train)
# Making predictions on the train and test set
xgb_y_train_pred = xgb_model.predict(X_train_xgb)
xgb_y_test_pred = xgb_model.predict(X_test_xgb)
# Evaluating the model
xgb_rmse_train = np.sqrt(mean_squared_error(y_train, xgb_y_train_pred))
xgb_rmse_test = np.sqrt(mean_squared_error(y_test, xgb_y_test_pred))
xgb_r2_train = r2_score(y_train, xgb_y_train_pred)
xgb_r2_test = r2_score(y_test, xgb_y_test_pred)
print(f"\nFinal Model Performance - XGBoost Regressor")
print(f"{'Metric':<30} {'Train Score':<20} {'Test Score':<20}")
print(f"{'-'*60}")
print(f"RMSE (Root Mean Squared Error): {'':<3} {xgb_rmse_train:.4f} {'':<10} {xgb_rmse_test:.4f}")
print(f"R² (R-squared): {'':<19} {xgb_r2_train:.4f} {'':<10} {xgb_r2_test:.4f}")
Final Model Performance - XGBoost Regressor Metric Train Score Test Score ------------------------------------------------------------ RMSE (Root Mean Squared Error): 1.2560 1.2801 R² (R-squared): 0.3149 0.2912
# Building the CatBoost Regressor model
cat_feature_model = CatBoostRegressor(n_estimators=100, learning_rate=0.1, random_state=42, verbose=0)
cat_feature_model.fit(X_train, y_train)
# Extract feature importance and select top 10 features
cat_feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': cat_feature_model.feature_importances_
})
cat_top_10_features = cat_feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True).head(10)
# Print the top 10 feature names and their importance scores
print("Top 10 Features by Importance (CatBoostRegressor):")
print(cat_top_10_features)
# Visualize the top 10 features
plt.figure(figsize=(10, 6))
plt.barh(cat_top_10_features['Feature'], cat_top_10_features['Importance'], color='skyblue')
plt.xlabel("Gain (Importance Score)")
plt.ylabel("Features")
plt.title("Top 10 Feature Importance (CatBoostRegressor)")
plt.gca().invert_yaxis()
plt.show()
Top 10 Features by Importance (CatBoostRegressor):
Feature Importance
0 genre1 29.374023
1 runtimeMinutes 22.236463
2 numVotes 21.318636
3 titleType_encoded 9.804616
4 genre2 5.253631
5 actress 2.909666
6 self 2.610866
7 director1 2.417334
8 startYear 2.315330
9 genre3 0.542612
# select top features
cat_best_features = cat_top_10_features['Feature'][:]
X_train_cat = X_train[cat_best_features]
X_test_cat = X_test[cat_best_features]
# Building the CatBoostRegressor model
cat_model = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6, verbose=0, random_state=42)
cat_model.fit(X_train_cat, y_train)
# Define RMSE as a scoring function
rmse_scorer = make_scorer(mean_squared_error, greater_is_better=False, squared=False) # squared=False returns RMSE directly
## Cross Validation
# Evaluate using RMSE
cat_rmse_scores = cross_val_score(cat_model, X_train_cat, y_train, cv=5, scoring=rmse_scorer)
cat_average_rmse = -cat_rmse_scores.mean()
print(f"\nCross-Validation (5-Fold) - CatBoost Regressor")
print(f"{'Metric':<30} {'Fold Scores':<50} {'Mean Score':<10}")
print(f"{'-'*95}")
print(f"RMSE Scores: {'':<12} {np.round(-cat_rmse_scores, 4)} {'':<18} {cat_average_rmse:.4f}")
# Evaluate using r2
r2_scores = cross_val_score(cat_model, X_train_cat, y_train, cv=5, scoring='r2')
average_r2 = r2_scores.mean()
print(f"R² Scores: {'':<14} {np.round(r2_scores, 4)} {'':<18} {average_r2:.4f}")
print('-' * 95)
Cross-Validation (5-Fold) - CatBoost Regressor Metric Fold Scores Mean Score ----------------------------------------------------------------------------------------------- RMSE Scores: [1.2671 1.2718 1.263 1.2548 1.2562] 1.2626 R² Scores: [0.3086 0.3038 0.3072 0.3138 0.3051] 0.3077 -----------------------------------------------------------------------------------------------
## Training with CAT top features
# Fit the model on the entire training data
cat_model.fit(X_train_cat, y_train)
# Making predictions on the train and test set
cat_y_train_pred = cat_model.predict(X_train_cat)
cat_y_test_pred = cat_model.predict(X_test_cat)
# Evaluating the model
cat_rmse_train = np.sqrt(mean_squared_error(y_train, cat_y_train_pred))
cat_rmse_test = np.sqrt(mean_squared_error(y_test, cat_y_test_pred))
cat_r2_train = r2_score(y_train, cat_y_train_pred)
cat_r2_test = r2_score(y_test, cat_y_test_pred)
print(f"\nFinal Model Performance - CatBoost Regressor")
print(f"{'Metric':<30} {'Train Score':<20} {'Test Score':<20}")
print(f"{'-'*60}")
print(f"RMSE (Root Mean Squared Error): {'':<3} {cat_rmse_train:.4f} {'':<10} {cat_rmse_test:.4f}")
print(f"R² (R-squared): {'':<19} {cat_r2_train:.4f} {'':<10} {cat_r2_test:.4f}")
Final Model Performance - CatBoost Regressor Metric Train Score Test Score ------------------------------------------------------------ RMSE (Root Mean Squared Error): 1.2309 1.2697 R² (R-squared): 0.3420 0.3026
# Scaling Dataset
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns) # Convert back to DataFrame after scaling
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
# Define the Neural Network model
nn_model = MLPRegressor(hidden_layer_sizes=(64, 32, 16), learning_rate_init=0.001, random_state=42, max_iter=10)
# Fit the model on the full training set
nn_model.fit(X_train_scaled, y_train)
# Calculate Permutation Importance
perm_importance = permutation_importance(nn_model, X_train_scaled, y_train, n_repeats=30, random_state=42)
# Create a DataFrame to store feature importance scores
nn_feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': perm_importance.importances_mean
})
# Select the top 10 features based on Permutation Importance
nn_top_10_features = nn_feature_importance_df.sort_values(by='Importance', ascending=False).head(10)
# Print the top 10 features
print("Top 10 Features by Importance (Neural Network):")
print(nn_top_10_features)
# Visualize the top 10 features using a bar plot
plt.figure(figsize=(10, 6))
plt.barh(nn_top_10_features['Feature'], nn_top_10_features['Importance'], color='skyblue')
plt.xlabel("Importance Score")
plt.ylabel("Features")
plt.title("Top 10 Feature Importance (Neural Network - Permutation Importance)")
plt.gca().invert_yaxis() # Invert y-axis to show highest importance at the top
plt.show()
Top 10 Features by Importance (Neural Network):
Feature Importance
2 runtimeMinutes 0.145602
15 genre1 0.080312
20 titleType_encoded 0.065549
18 director1 0.028232
16 genre2 0.026180
13 self 0.025383
4 actress 0.020671
9 director 0.019552
3 numVotes 0.014986
1 startYear 0.009675
# select top features
nn_best_features = nn_top_10_features['Feature'][:]
X_train_nn = X_train_scaled[nn_best_features]
X_test_nn = X_test_scaled[nn_best_features]
# Building the NN model
nn_model = Sequential()
nn_model.add(Dense(64, input_dim=X_train_nn.shape[1], activation='relu'))
nn_model.add(Dense(32, activation='relu'))
nn_model.add(Dense(16, activation='relu'))
nn_model.add(Dense(1, activation='linear'))
# Compile Model
nn_model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
nn_model.fit(X_train_nn, y_train, validation_split=0.1, epochs=10, batch_size=32, verbose=1)
# Making predictions on the train and test set
nn_y_train_pred = nn_model.predict(X_train_nn)
nn_y_test_pred = nn_model.predict(X_test_nn)
# Evaluating the model
nn_rmse_train = np.sqrt(mean_squared_error(y_train, nn_y_train_pred))
nn_rmse_test = np.sqrt(mean_squared_error(y_test, nn_y_test_pred))
nn_r2_train = r2_score(y_train, nn_y_train_pred)
nn_r2_test = r2_score(y_test, nn_y_test_pred)
print(f'Mean Squared Error -- Train: {nn_rmse_train} | Test: {nn_rmse_test}')
print(f'R-squared Score -- Train: {nn_r2_train} | Test: {nn_r2_test}')
Epoch 1/10 5688/5688 [==============================] - 7s 1ms/step - loss: 2.6734 - val_loss: 2.3032 Epoch 2/10 5688/5688 [==============================] - 7s 1ms/step - loss: 1.9360 - val_loss: 1.8856 Epoch 3/10 5688/5688 [==============================] - 7s 1ms/step - loss: 1.8725 - val_loss: 1.9092 Epoch 4/10 5688/5688 [==============================] - 7s 1ms/step - loss: 1.8493 - val_loss: 1.8968 Epoch 5/10 5688/5688 [==============================] - 7s 1ms/step - loss: 1.8324 - val_loss: 1.8289 Epoch 6/10 5688/5688 [==============================] - 7s 1ms/step - loss: 1.8204 - val_loss: 1.8317 Epoch 7/10 5688/5688 [==============================] - 7s 1ms/step - loss: 1.8141 - val_loss: 1.8318 Epoch 8/10 5688/5688 [==============================] - 7s 1ms/step - loss: 1.8068 - val_loss: 1.8100 Epoch 9/10 5688/5688 [==============================] - 7s 1ms/step - loss: 1.8022 - val_loss: 1.8897 Epoch 10/10 5688/5688 [==============================] - 7s 1ms/step - loss: 1.7984 - val_loss: 1.8054 6320/6320 [==============================] - 5s 737us/step 1580/1580 [==============================] - 1s 724us/step Mean Squared Error -- Train: 1.332724141109076 | Test: 1.3447710435702702 R-squared Score -- Train: 0.22866824052349222 | Test: 0.21775651778042526
# Putting the dataset together
predictions_df = X_test.copy()
predictions_df['averageRating'] = y_test
predictions_df['GB_preds'] = gb_y_test_pred
predictions_df['LGBM_preds'] = lgbm_y_test_pred
predictions_df['XGB_preds'] = xgb_y_test_pred
predictions_df['CAT_preds'] = cat_y_test_pred
predictions_df['NN_preds'] = nn_y_test_pred
# Display
predictions_df.head()
| isAdult | startYear | runtimeMinutes | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | averageRating | GB_preds | LGBM_preds | XGB_preds | CAT_preds | NN_preds | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99618 | 0.0 | 2011 | 83.0 | 102967.0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 6 | 2 | 281508 | 0 | 1 | 6.1 | 5.727940 | 5.846384 | 6.026606 | 6.187697 | 6.410004 |
| 28895 | 0.0 | 2007 | 51.0 | 8.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2152619 | 0 | 3 | 9.1 | 7.046470 | 7.000281 | 7.013858 | 6.927183 | 6.589914 |
| 187678 | 0.0 | 2014 | 9.0 | 5.0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 6 | 11 | 0 | 7121989 | 0 | 2 | 9.6 | 7.354214 | 7.534873 | 7.639766 | 7.613340 | 6.978806 |
| 17756 | 0.0 | 2006 | 88.0 | 419.0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 494069 | 0 | 1 | 6.5 | 5.446130 | 5.977076 | 5.907272 | 5.978857 | 6.215247 |
| 140193 | 0.0 | 2012 | 90.0 | 90.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 3 | 0 | 5188763 | 0 | 1 | 4.6 | 5.543962 | 5.639178 | 5.689574 | 5.755277 | 5.582937 |
# Create model performance table
model_perfor_df = pd.DataFrame({
"Model": ["GradientBoosting", "LightGBM", "XGBoost", "CatBoost", "NeuralNetwork"],
"R-Squared Train": [gb_r2_train, lgbm_r2_train, xgb_r2_train, cat_r2_train, nn_r2_train],
"R-Squared Test": [gb_r2_test, lgbm_r2_test, xgb_r2_test, cat_r2_test, nn_r2_test],
"RMSE Train": [gb_rmse_train, lgbm_rmse_train, xgb_rmse_train, cat_rmse_train, nn_rmse_train],
"RMSE Test": [gb_rmse_test, lgbm_rmse_test, xgb_rmse_test, cat_rmse_test, nn_rmse_test]
}).round(3)
# Sort by RMSE on test set
model_perfor_df = model_perfor_df.sort_values(by=['RMSE Test'], ascending=True).reset_index(drop=True)
model_perfor_df.head()
| Model | R-Squared Train | R-Squared Test | RMSE Train | RMSE Test | |
|---|---|---|---|---|---|
| 0 | CatBoost | 0.342 | 0.303 | 1.231 | 1.270 |
| 1 | XGBoost | 0.315 | 0.291 | 1.256 | 1.280 |
| 2 | LightGBM | 0.308 | 0.290 | 1.263 | 1.281 |
| 3 | GradientBoosting | 0.247 | 0.240 | 1.317 | 1.325 |
| 4 | NeuralNetwork | 0.229 | 0.218 | 1.333 | 1.345 |
All models demonstrated a relatively poor performance on both train and test sets, indicating limited predictive power. The CatBoost Regressor was the best model with an R-squared of 30.3% on the test set, slightly outperforming the others. This suggests it was able to generalize better, but it still explain only 30.3% of the variance in the tagert feature averageRating.
Given the low R-squared values, none of the models are suitable for production in their current form. To improve, the focus should be on:
Feature Engineering: Create new interaction features or non-linear transformations.
Data Enhancement: Include additional external data sources or more granular features.
Domain Knowledge: Collaborate with stakeholders to better understand the problem and identify relevant features that may not be captured in the current dataset.
Classification Problem: Transform the regression problem into a classification task by creating bins for averageRating. Predicting the bin each record falls into could increase model interpretability and potentially improve accuracy, making the predictions more actionable and easier to evaluate.
With these strategies, these models may have their performance significantly improved, making it more appropriate for production deployment.
You will make classification models here. Use Rating as the target variable. Make sure to exclude averageRating(Why?) from the predictor variables.
Select five advanced classification models beyond basic ones like Multinomial Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors, and SVM. Determine appropriate test size and validation method, providing a clear rationale for your choices. Split your data into training and testing sets based on these decisions. Use 'Rating' as your target variable. Explain why 'averageRating' should be excluded. Preprocess your feature set as required by your selected models.
Implement each of your five chosen classification models using the prepared data. Train each model on the training set and generate predictions for the test set. Create a table showing the accuracy scores for all five models on both the training and testing data. For your best-performing model, create and display a colored heatmap of the confusion matrix for the test data predictions.
Select one additional metric beyond accuracy for validating your models, explaining why you chose this metric. Evaluate all five models using this additional metric and determine which model performs best based on it. Critically assess whether your best model is suitable for production use, discussing its strengths and limitations. Propose specific strategies or techniques you would employ to further improve the model's performance.
# Standards
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
# Data prep for modeling
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.inspection import permutation_importance
# Modeling
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier
# Evaluation
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, make_scorer, f1_score
# Data location
data_loc = "/Users/gabrielvictorgomesferreira/Library/Mobile Documents/com~apple~CloudDocs/Work/ISU Classes/MIS 546 - Advanced Business Analytics/Data/"
file_name = "newimdb_clean.csv"
# Import dataset
clean_df = pd.read_csv(data_loc + file_name)
clean_df.head()
| isAdult | startYear | runtimeMinutes | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 2015 | 95.903562 | 8.1 | 63.0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | 3 | 13648797 | 0 | 1 | 5 |
| 1 | 0.0 | 2001 | 104.000000 | 2.4 | 23686.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 3 | 4 | 5 | 193554 | 0 | 1 | 1 |
| 2 | 0.0 | 2001 | 100.000000 | 4.9 | 1850.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 6 | 7 | 8 | 440948 | 0 | 1 | 1 |
| 3 | 0.0 | 2005 | 77.000000 | 7.3 | 279860.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 3 | 10 | 318 | 425843 | 1 | 4 |
| 4 | 0.0 | 2005 | 10.000000 | 6.0 | 343.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 473647 | 0 | 2 | 2 |
# Predictor Features
X = clean_df.drop(columns=['averageRating', 'rating'])
# Tagert Feature
y = clean_df['rating']
Here, we need to drop the averageRating column because it'd leak 100% of the informaion needed for the model predict rating. In a real scenario, we would not have this feature in an inference batch to make predictions on. Therefore, we must remove it.
# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
I selected an 80/20 train-test split for the data, where 80% is used for training and 20% for testing. This provides a good balance, ensuring that the model has enough data to learn from while having also enough data for an unbiased evaluation of performance on unseen samples.
For model validation, I chose 5-Fold Cross-Validation on the training set. This method divides the training data into 5 subsets, where the model is trained on 4 subsets and validated on the remaining one. This process is repeated 5 times, in which each subset is treated as a validation set at least once. At the end, the average results across all folds helps on the assessment of the model’s generalization capability, which may reduce variance and mitigate the risk of overfitting.
Models to be considered:
Evaluation methods: For the following classification models, I am going to use Accuracy and F-1 Score as the evaluation methods. Accuracy indicates how often the model correctly predicts the labels, making it a straightforward metric for evaluating overall performance. F1-score combines precision (correct positive predictions) and recall (capturing all positives), ensuring the model performs well across all classes. Together, they provide a complete picture of prediction quality, covering both correctness and consistency in classification.
# Building the AdaBoostClassifier model
ada_feature_model = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_feature_model.fit(X_train, y_train)
perm_importance = permutation_importance(ada_feature_model, X_train, y_train, n_repeats=30, random_state=42)
# Create a DataFrame to store feature importances
ada_feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': perm_importance.importances_mean
})
# Sort by importance and select top 10
ada_top_10_features = ada_feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True).head(10)
# Print the top 10 feature names and their importance scores
print("Top 10 Features by Importance (AdaBoostClassifier):")
print(ada_top_10_features)
# Visualize the top 10 features
plt.figure(figsize=(10, 6))
plt.barh(ada_top_10_features['Feature'], ada_top_10_features['Importance'], color='skyblue')
plt.xlabel("Gain (Importance Score)")
plt.ylabel("Features")
plt.title("Top 10 Feature Importance (AdaBoostClassifier)")
plt.gca().invert_yaxis()
plt.show()
Top 10 Features by Importance (AdaBoostClassifier):
Feature Importance
0 genre1 0.035882
1 numVotes 0.035691
2 runtimeMinutes 0.035285
3 titleType_encoded 0.008900
4 director1 0.007375
5 genre2 0.004172
6 startYear 0.002592
7 producer 0.002353
8 actress 0.001954
9 editor 0.000884
# select top features
ada_best_features = ada_top_10_features['Feature'][:]
X_train_ada = X_train[ada_best_features]
X_test_ada = X_test[ada_best_features]
# Building the AdaBoostClassifier model
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_model.fit(X_train_ada, y_train)
# Define a custom scorer for F1 score (weighted to account for class imbalance)
f1_scorer = make_scorer(f1_score, average='weighted')
# Perform 5-fold cross-validation using accuracy and F1-score
ada_cv_accuracy_scores = cross_val_score(ada_model, X_train_ada, y_train, cv=5, scoring='accuracy')
ada_cv_f1_scores = cross_val_score(ada_model, X_train_ada, y_train, cv=5, scoring=f1_scorer)
print(f"\nCross-Validation Scores (5-Fold) for AdaBoost")
print("=============================================")
print(f"Accuracy: {ada_cv_accuracy_scores.mean():.3f} ± {ada_cv_accuracy_scores.std():.3f}")
print(f"F1-Score: {ada_cv_f1_scores.mean():.3f} ± {ada_cv_f1_scores.std():.3f}")
Cross-Validation Scores (5-Fold) for AdaBoost ============================================= Accuracy: 0.353 ± 0.002 F1-Score: 0.322 ± 0.002
# Making predictions on the train and test set
ada_y_train_pred = ada_model.predict(X_train_ada)
ada_y_test_pred = ada_model.predict(X_test_ada)
# Evaluate the model's performance
print("\nFinal Model Evaluation for AdaBoost Classifier")
print("=============================================")
# Training Set Evaluation
ada_train_accuracy = accuracy_score(y_train, ada_y_train_pred)
ada_train_f1_score = f1_score(y_train, ada_y_train_pred, average='weighted')
print("Training Set Performance")
print(f"Accuracy: {ada_train_accuracy:.3f}")
print(f"F1-Score: {ada_train_f1_score:.3f}")
# print(f"Confusion Matrix:\n{confusion_matrix(y_train, ada_y_train_pred)}")
# print(f"Classification Report:\n{classification_report(y_train, ada_y_train_pred)}")
# Testing Set Evaluation
ada_test_accuracy = accuracy_score(y_test, ada_y_test_pred)
ada_test_f1_score = f1_score(y_test, ada_y_test_pred, average='weighted')
print("\nTesting Set Performance")
print(f"Accuracy: {ada_test_accuracy:.3f}")
print(f"F1-Score: {ada_test_f1_score:.3f}")
# print(f"Confusion Matrix:\n{confusion_matrix(y_test, ada_y_test_pred)}")
# print(f"Classification Report:\n{classification_report(y_test, ada_y_test_pred)}")
Final Model Evaluation for AdaBoost Classifier ============================================= Training Set Performance Accuracy: 0.354 F1-Score: 0.326 Testing Set Performance Accuracy: 0.351 F1-Score: 0.323
# Feature Importance using Permutation Importance
xgb_feature_model = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_feature_model.fit(X_train, y_train-1)
# Calculate permutation importance
perm_importance = permutation_importance(xgb_feature_model, X_train, y_train-1, n_repeats=30, random_state=42)
# Create a DataFrame to store feature importances
xgb_feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': perm_importance.importances_mean
})
# Sort by importance and select top 10
xgb_top_10_features = xgb_feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True).head(10)
# Print the top 10 feature names and their importance scores
print("Top 10 Features by Importance (XGBoost Classifier):")
print(xgb_top_10_features)
# Visualize the top 10 features
plt.figure(figsize=(10, 6))
plt.barh(xgb_top_10_features['Feature'], xgb_top_10_features['Importance'], color='skyblue')
plt.xlabel("Gain (Importance Score)")
plt.ylabel("Features")
plt.title("Top 10 Feature Importance (XGBoost Classifier)")
plt.gca().invert_yaxis()
plt.show()
Top 10 Features by Importance (XGBoost Classifier):
Feature Importance
0 numVotes 0.079104
1 runtimeMinutes 0.071120
2 genre1 0.068923
3 titleType_encoded 0.035874
4 director1 0.033853
5 startYear 0.026643
6 genre2 0.025524
7 genre3 0.012835
8 director2 0.012210
9 producer 0.005652
# Select top features
xgb_best_features = xgb_top_10_features['Feature'][:]
X_train_xgb = X_train[xgb_best_features]
X_test_xgb = X_test[xgb_best_features]
# Building the XGBoostClassifier model
xgb_model = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_model.fit(X_train_xgb, y_train-1)
# Define a custom scorer for F1 score (weighted to account for class imbalance)
f1_scorer = make_scorer(f1_score, average='weighted')
# Perform 5-fold cross-validation using accuracy and F1-score
xgb_cv_accuracy_scores = cross_val_score(xgb_model, X_train_xgb, y_train-1, cv=5, scoring='accuracy')
xgb_cv_f1_scores = cross_val_score(xgb_model, X_train_xgb, y_train-1, cv=5, scoring=f1_scorer)
print(f"\nCross-Validation Scores (5-Fold) for XGBoost")
print("=============================================")
print(f"Accuracy: {xgb_cv_accuracy_scores.mean():.3f} ± {xgb_cv_accuracy_scores.std():.3f}")
print(f"F1-Score: {xgb_cv_f1_scores.mean():.3f} ± {xgb_cv_f1_scores.std():.3f}")
Cross-Validation Scores (5-Fold) for XGBoost ============================================= Accuracy: 0.382 ± 0.002 F1-Score: 0.365 ± 0.002
# Making predictions on the train and test set
xgb_y_train_pred = xgb_model.predict(X_train_xgb)+1
xgb_y_test_pred = xgb_model.predict(X_test_xgb)+1
# Evaluate the model's performance
print("\nFinal Model Evaluation for XGBoost Classifier")
print("=============================================")
# Training Set Evaluation
xgb_train_accuracy = accuracy_score(y_train, xgb_y_train_pred)
xgb_train_f1_score = f1_score(y_train, xgb_y_train_pred, average='weighted')
print("Training Set Performance")
print(f"Accuracy: {xgb_train_accuracy:.3f}")
print(f"F1-Score: {xgb_train_f1_score:.3f}")
# print(f"Confusion Matrix:\n{confusion_matrix(y_train, xgb_y_train_pred)}")
# print(f"Classification Report:\n{classification_report(y_train, xgb_y_train_pred)}")
# Testing Set Evaluation
xgb_test_accuracy = accuracy_score(y_test, xgb_y_test_pred)
xgb_test_f1_score = f1_score(y_test, xgb_y_test_pred, average='weighted')
print("\nTesting Set Performance")
print(f"Accuracy: {xgb_train_accuracy:.3f}")
print(f"F1-Score: {xgb_train_f1_score:.3f}")
# print(f"Confusion Matrix:\n{confusion_matrix(y_test, xgb_y_test_pred)}")
# print(f"Classification Report:\n{classification_report(y_test, xgb_y_test_pred)}")
Final Model Evaluation for XGBoost Classifier ============================================= Training Set Performance Accuracy: 0.442 F1-Score: 0.428 Testing Set Performance Accuracy: 0.442 F1-Score: 0.428
# Building the CatBoostClassifier model
cat_feature_model = CatBoostClassifier(n_estimators=100, random_state=42, verbose=0)
cat_feature_model.fit(X_train, y_train)
# Calculate permutation importance
perm_importance = permutation_importance(cat_feature_model, X_train, y_train, n_repeats=30, random_state=42)
# Create a DataFrame to store feature importances
cat_feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': perm_importance.importances_mean
})
# Sort by importance and select top 10
cat_top_10_features = cat_feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True).head(10)
# Print the top 10 feature names and their importance scores
print("Top 10 Features by Importance (CatBoost Classifier):")
print(cat_top_10_features)
# Visualize the top 10 features
plt.figure(figsize=(10, 6))
plt.barh(cat_top_10_features['Feature'], cat_top_10_features['Importance'], color='skyblue')
plt.xlabel("Gain (Importance Score)")
plt.ylabel("Features")
plt.title("Top 10 Feature Importance (CatBoost Classifier)")
plt.gca().invert_yaxis()
plt.show()
Top 10 Features by Importance (CatBoost Classifier):
Feature Importance
0 numVotes 0.064452
1 runtimeMinutes 0.056290
2 genre1 0.054841
3 titleType_encoded 0.033236
4 director1 0.017670
5 genre2 0.016108
6 startYear 0.015642
7 self 0.007302
8 genre3 0.006003
9 director2 0.004298
# Select top features
cat_best_features = cat_top_10_features['Feature'][:]
X_train_cat = X_train[cat_best_features]
X_test_cat = X_test[cat_best_features]
# Building the CatBoostClassifier model
cat_model = CatBoostClassifier(n_estimators=100, random_state=42, verbose=0)
cat_model.fit(X_train_cat, y_train)
# Define a custom scorer for F1 score (weighted to account for class imbalance)
f1_scorer = make_scorer(f1_score, average='weighted')
# Perform 5-fold cross-validation using accuracy and F1-score
cat_cv_accuracy_scores = cross_val_score(cat_model, X_train_cat, y_train, cv=5, scoring='accuracy')
cat_cv_f1_scores = cross_val_score(cat_model, X_train_cat, y_train, cv=5, scoring=f1_scorer)
print(f"\nCross-Validation Scores (5-Fold) for CatBoost")
print("=============================================")
print(f"Accuracy: {cat_cv_accuracy_scores.mean():.3f} ± {cat_cv_accuracy_scores.std():.3f}")
print(f"F1-Score: {cat_cv_f1_scores.mean():.3f} ± {cat_cv_f1_scores.std():.3f}")
Cross-Validation Scores (5-Fold) for CatBoost ============================================= Accuracy: 0.378 ± 0.002 F1-Score: 0.361 ± 0.002
# Making predictions on the train and test set
cat_y_train_pred = cat_model.predict(X_train_cat)
cat_y_test_pred = cat_model.predict(X_test_cat)
# Evaluate the model's performance
print("\nFinal Model Evaluation for CatBoost Classifier")
print("=============================================")
# Training Set Evaluation
cat_train_accuracy = accuracy_score(y_train, cat_y_train_pred)
cat_train_f1_score = f1_score(y_train, cat_y_train_pred, average='weighted')
print("Training Set Performance")
print(f"Accuracy: {cat_train_accuracy:.3f}")
print(f"F1-Score: {cat_train_f1_score:.3f}")
# print(f"Confusion Matrix:\n{confusion_matrix(y_train, cat_y_train_pred)}")
# print(f"Classification Report:\n{classification_report(y_train, cat_y_train_pred)}")
# Testing Set Evaluation
cat_test_accuracy = accuracy_score(y_test, cat_y_test_pred)
cat_test_f1_score = f1_score(y_test, cat_y_test_pred, average='weighted')
print("\nTesting Set Performance")
print(f"Accuracy: {cat_test_accuracy:.3f}")
print(f"F1-Score: {cat_test_f1_score:.3f}")
# print(f"Confusion Matrix:\n{confusion_matrix(y_test, cat_y_test_pred)}")
# print(f"Classification Report:\n{classification_report(y_test, cat_y_test_pred)}")
Final Model Evaluation for CatBoost Classifier ============================================= Training Set Performance Accuracy: 0.400 F1-Score: 0.383 Testing Set Performance Accuracy: 0.378 F1-Score: 0.359
# Building the LightGBMClassifier model
lgbm_feature_model = LGBMClassifier(n_estimators=100, random_state=42, verbose=0)
lgbm_feature_model.fit(X_train, y_train)
# Calculate permutation importance
perm_importance = permutation_importance(lgbm_feature_model, X_train, y_train, n_repeats=30, random_state=42)
# Create a DataFrame to store feature importances
lgbm_feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': perm_importance.importances_mean
})
# Sort by importance and select top 10
lgbm_top_10_features = lgbm_feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True).head(10)
# Print the top 10 feature names and their importance scores
print("Top 10 Features by Importance (LightGBM Classifier):")
print(lgbm_top_10_features)
# Visualize the top 10 features
plt.figure(figsize=(10, 6))
plt.barh(lgbm_top_10_features['Feature'], lgbm_top_10_features['Importance'], color='skyblue')
plt.xlabel("Gain (Importance Score)")
plt.ylabel("Features")
plt.title("Top 10 Feature Importance (LightGBM Classifier)")
plt.gca().invert_yaxis()
plt.show()
Top 10 Features by Importance (LightGBM Classifier):
Feature Importance
0 numVotes 0.062097
1 runtimeMinutes 0.059598
2 genre1 0.057071
3 titleType_encoded 0.027865
4 director1 0.015476
5 genre2 0.014882
6 startYear 0.012541
7 genre3 0.006184
8 producer 0.004229
9 director2 0.004000
# Select top features
lgbm_best_features = lgbm_top_10_features['Feature'][:]
X_train_lgbm = X_train[lgbm_best_features]
X_test_lgbm = X_test[lgbm_best_features]
# Building the LightGBMClassifier model
lgbm_model = LGBMClassifier(n_estimators=100, random_state=42)
lgbm_model.fit(X_train_lgbm, y_train)
# Define a custom scorer for F1 score (weighted to account for class imbalance)
f1_scorer = make_scorer(f1_score, average='weighted')
# Perform 5-fold cross-validation using accuracy and F1-score
lgbm_cv_accuracy_scores = cross_val_score(lgbm_model, X_train_lgbm, y_train, cv=5, scoring='accuracy')
lgbm_cv_f1_scores = cross_val_score(lgbm_model, X_train_lgbm, y_train, cv=5, scoring=f1_scorer)
print(f"\nCross-Validation Scores (5-Fold) for LightGBM")
print("=============================================")
print(f"Accuracy: {lgbm_cv_accuracy_scores.mean():.3f} ± {lgbm_cv_accuracy_scores.std():.3f}")
print(f"F1-Score: {lgbm_cv_f1_scores.mean():.3f} ± {lgbm_cv_f1_scores.std():.3f}")
Cross-Validation Scores (5-Fold) for LightGBM ============================================= Accuracy: 0.381 ± 0.002 F1-Score: 0.361 ± 0.002
# Making predictions on the train and test set
lgbm_y_train_pred = lgbm_model.predict(X_train_lgbm)
lgbm_y_test_pred = lgbm_model.predict(X_test_lgbm)
# Evaluate the model's performance
print("\nFinal Model Evaluation for LightGBM Classifier")
print("=============================================")
# Training Set Performance
lgbm_train_accuracy = accuracy_score(y_train, lgbm_y_train_pred)
lgbm_train_f1_score = f1_score(y_train, lgbm_y_train_pred, average='weighted')
print("Training Set Performance")
print(f"Accuracy: {lgbm_train_accuracy:.3f}")
print(f"F1-Score: {lgbm_train_f1_score:.3f}")
# print(f"Confusion Matrix:\n{confusion_matrix(y_train, lgbm_y_train_pred)}")
# print(f"Classification Report:\n{classification_report(y_train, lgbm_y_train_pred)}")
# Testing Set Performance
lgbm_test_accuracy = accuracy_score(y_test, lgbm_y_test_pred)
lgbm_test_f1_score = f1_score(y_test, lgbm_y_test_pred, average='weighted')
print("\nTesting Set Performance")
print(f"Accuracy: {lgbm_test_accuracy:.3f}")
print(f"F1-Score: {lgbm_test_f1_score:.3f}")
# print(f"Confusion Matrix:\n{confusion_matrix(y_test, lgbm_y_test_pred)}")
# print(f"Classification Report:\n{classification_report(y_test, lgbm_y_test_pred)}")
Final Model Evaluation for LightGBM Classifier ============================================= Training Set Performance Accuracy: 0.408 F1-Score: 0.390 Testing Set Performance Accuracy: 0.380 F1-Score: 0.359
# Building the Neural Network model
nn_feature_model = MLPClassifier(hidden_layer_sizes=(64, 32, 16), max_iter=1000, random_state=42)
nn_feature_model.fit(X_train, y_train)
# Calculate permutation importance
perm_importance = permutation_importance(nn_feature_model, X_train, y_train, n_repeats=30, random_state=42)
# Create a DataFrame to store feature importances
nn_feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': perm_importance.importances_mean
})
# Sort by importance and select top 10
nn_top_10_features = nn_feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True).head(10)
# Print the top 10 feature names and their importance scores
print("Top 10 Features by Importance (Neural Network Classifier):")
print(nn_top_10_features)
# Visualize the top 10 features
plt.figure(figsize=(10, 6))
plt.barh(nn_top_10_features['Feature'], nn_top_10_features['Importance'], color='skyblue')
plt.xlabel("Importance Score")
plt.ylabel("Features")
plt.title("Top 10 Feature Importance (Neural Network Classifier)")
plt.gca().invert_yaxis()
plt.show()
Top 10 Features by Importance (Neural Network Classifier):
Feature Importance
0 isAdult 0.0
1 producer 0.0
2 director2 0.0
3 director1 0.0
4 genre3 0.0
5 genre2 0.0
6 genre1 0.0
7 writer 0.0
8 self 0.0
9 production_designer 0.0
# Select top features
nn_best_features = nn_top_10_features['Feature'][:]
X_train_nn = X_train[nn_best_features]
X_test_nn = X_test[nn_best_features]
# Building the Neural Network model
nn_model = MLPClassifier(hidden_layer_sizes=(64, 32, 16), max_iter=1000, random_state=42)
nn_model.fit(X_train_nn, y_train)
# Define a custom scorer for F1 score (weighted to account for class imbalance)
f1_scorer = make_scorer(f1_score, average='weighted')
# Perform 5-fold cross-validation using accuracy and F1-score
nn_cv_accuracy_scores = cross_val_score(nn_model, X_train_nn, y_train, cv=5, scoring='accuracy')
nn_cv_f1_scores = cross_val_score(nn_model, X_train_nn, y_train, cv=5, scoring=f1_scorer)
print(f"\nCross-Validation Scores (5-Fold) for Neural Network")
print("=============================================")
print(f"Accuracy: {nn_cv_accuracy_scores.mean():.3f} ± {nn_cv_accuracy_scores.std():.3f}")
print(f"F1-Score: {nn_cv_f1_scores.mean():.3f} ± {nn_cv_f1_scores.std():.3f}")
Cross-Validation Scores (5-Fold) for Neural Network ============================================= Accuracy: 0.216 ± 0.004 F1-Score: 0.093 ± 0.014
# Making predictions on the train and test set
nn_y_train_pred = nn_model.predict(X_train_nn)
nn_y_test_pred = nn_model.predict(X_test_nn)
# Evaluate the model's performance
print("\nFinal Model Evaluation for Neural Network Classifier")
print("=============================================")
# Training Set Performance
nn_train_accuracy = accuracy_score(y_train, nn_y_train_pred)
nn_train_f1_score = f1_score(y_train, nn_y_train_pred, average='weighted')
print("Training Set Performance")
print(f"Accuracy: {nn_train_accuracy:.3f}")
print(f"F1-Score: {nn_train_f1_score:.3f}")
# print(f"Confusion Matrix:\n{confusion_matrix(y_train, nn_y_train_pred)}")
# print(f"Classification Report:\n{classification_report(y_train, nn_y_train_pred)}")
# Testing Set Performance
nn_test_accuracy = accuracy_score(y_test, nn_y_test_pred)
nn_test_f1_score = f1_score(y_test, nn_y_test_pred, average='weighted')
print("\nTesting Set Performance")
print(f"Accuracy: {nn_test_accuracy:.3f}")
print(f"F1-Score: {nn_test_f1_score:.3f}")
# print(f"Confusion Matrix:\n{confusion_matrix(y_test, nn_y_test_pred)}")
# print(f"Classification Report:\n{classification_report(y_test, nn_y_test_pred)}")
Final Model Evaluation for Neural Network Classifier ============================================= Training Set Performance Accuracy: 0.215 F1-Score: 0.086 Testing Set Performance Accuracy: 0.215 F1-Score: 0.085
# Putting the dataset together
class_predictions_df = X_test.copy()
class_predictions_df['rating'] = y_test
class_predictions_df['ADA_preds'] = ada_y_test_pred
class_predictions_df['XGB_preds'] = xgb_y_test_pred
class_predictions_df['CAT_preds'] = cat_y_test_pred
class_predictions_df['LGBM_preds'] = lgbm_y_test_pred
class_predictions_df['NN_preds'] = nn_y_test_pred
# Display
class_predictions_df.head()
| isAdult | startYear | runtimeMinutes | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | rating | ADA_preds | XGB_preds | CAT_preds | LGBM_preds | NN_preds | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99618 | 0.0 | 2011 | 83.0 | 102967.0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 6 | 2 | 281508 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 1 |
| 28895 | 0.0 | 2007 | 51.0 | 8.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2152619 | 0 | 3 | 5 | 5 | 5 | 5 | 5 | 1 |
| 187678 | 0.0 | 2014 | 9.0 | 5.0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 6 | 11 | 0 | 7121989 | 0 | 2 | 5 | 5 | 5 | 5 | 5 | 1 |
| 17756 | 0.0 | 2006 | 88.0 | 419.0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 494069 | 0 | 1 | 3 | 1 | 2 | 2 | 2 | 1 |
| 140193 | 0.0 | 2012 | 90.0 | 90.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 3 | 0 | 5188763 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
# Create model performance table
class_model_perfor_df = pd.DataFrame({
"Model": ["AdaBoost", "XGBoost", "CatBoost", "LightGBM", "NeuralNetwork"],
"Accuracy Train": [ada_train_accuracy, xgb_train_accuracy, cat_train_accuracy, lgbm_train_accuracy, nn_train_accuracy],
"Accuracy Test": [ada_test_accuracy, xgb_test_accuracy, cat_test_accuracy, lgbm_test_accuracy, nn_test_accuracy],
"F1-Score Train": [ada_train_f1_score, xgb_train_f1_score, cat_train_f1_score, lgbm_train_f1_score, nn_train_f1_score],
"F1-Score Test": [ada_test_f1_score, xgb_test_f1_score, cat_test_f1_score, lgbm_test_f1_score, nn_test_f1_score]
}).round(3)
# Sort by Accuracy on test set
class_model_perfor_df = class_model_perfor_df.sort_values(by=['Accuracy Test'], ascending=False).reset_index(drop=True)
class_model_perfor_df.head()
| Model | Accuracy Train | Accuracy Test | F1-Score Train | F1-Score Test | |
|---|---|---|---|---|---|
| 0 | XGBoost | 0.442 | 0.381 | 0.428 | 0.364 |
| 1 | LightGBM | 0.408 | 0.380 | 0.390 | 0.359 |
| 2 | CatBoost | 0.400 | 0.378 | 0.383 | 0.359 |
| 3 | AdaBoost | 0.354 | 0.351 | 0.326 | 0.323 |
| 4 | NeuralNetwork | 0.215 | 0.215 | 0.086 | 0.085 |
All models demonstrated a relatively poor performance on both train and test sets, indicating limited predictive power. The XGBoost was the best model with an accuracy of 38.1% on the test set, slightly outperforming the others. This suggests it was able to generalize better, but it still correctly classified the labels in the target feature rating only 38.1% of the time.
Given the low accuracy values, none of the models are suitable for production in their current form. To improve, the focus should be on:
Feature Engineering: Create new interaction features or non-linear transformations.
Data Enhancement: Include additional external data sources or more granular features.
Domain Knowledge: Collaborate with stakeholders to better understand the problem and identify relevant features that may not be captured in the current dataset.
These strategies are what we have left to attempt in order to improve model performance and have a more appropriate model for production deployment.
# XGBoost Confusion Matrix for the test dataset
conf_matrix = confusion_matrix(y_test, xgb_y_test_pred)
# Step 8: Plot the confusion matrix as a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()
For this problem, you'll create clustering models using techniques other than K-Means clustering. Here's the clustering problem structured into three parts:
Select relevant columns from the cleaned data for clustering, providing a brief explanation for your choices. Determine the appropriate number of clusters (K) you want to use and justify your reasoning. This step should involve careful consideration of the dataset's characteristics and the potential insights you aim to gain from the clustering process.
Implement two clustering models using techniques other than K-Means clustering with the chosen features. Predict cluster assignments for data points using each model. Add these cluster labels to the original dataset from problem 1, creating new columns to represent the cluster assignments from each model.
For each of the two clustering models, identify two new pieces of information or insights about the dataset using the added cluster labels. Analyze how the clusters relate to other variables in the dataset and what patterns or groupings they reveal. Explain your process for deriving these insights and discuss their potential implications for understanding the IMDb dataset. Consider how these insights might be valuable for further analysis or decision-making in the context of movie ratings and characteristics.
# Standards
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.metrics import silhouette_score
# Data location
data_loc = "/Users/gabrielvictorgomesferreira/Library/Mobile Documents/com~apple~CloudDocs/Work/ISU Classes/MIS 546 - Advanced Business Analytics/Data/"
file_name = "newimdb_clean.csv"
# Import dataset
clean_df = pd.read_csv(data_loc + file_name)
print(clean_df.shape)
clean_df.head()
(252763, 23)
| isAdult | startYear | runtimeMinutes | averageRating | numVotes | actress | archive_footage | archive_sound | cinematographer | composer | director | editor | producer | production_designer | self | writer | genre1 | genre2 | genre3 | director1 | director2 | titleType_encoded | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 2015 | 95.903562 | 8.1 | 63.0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | 3 | 13648797 | 0 | 1 | 5 |
| 1 | 0.0 | 2001 | 104.000000 | 2.4 | 23686.0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 3 | 4 | 5 | 193554 | 0 | 1 | 1 |
| 2 | 0.0 | 2001 | 100.000000 | 4.9 | 1850.0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 6 | 7 | 8 | 440948 | 0 | 1 | 1 |
| 3 | 0.0 | 2005 | 77.000000 | 7.3 | 279860.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 3 | 10 | 318 | 425843 | 1 | 4 |
| 4 | 0.0 | 2005 | 10.000000 | 6.0 | 343.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 473647 | 0 | 2 | 2 |
Feature Selection: I'm going to use what I've observed from the feature importance bar-plots from the models above to select the top 5 features to work on the clustering problem.
Models to be considered
Number of Clusters: I'm going to use Elbow Method to help me to identify an optimal number of clusters.
# Defining observed relevant columns
relevant_features = ['genre1', 'director1', 'runtimeMinutes', 'titleType_encoded', 'genre2', 'averageRating']
# Selecting observed relevant columns
cluster_df = clean_df[relevant_features]
# Scale the dataset before clustering
scaler = StandardScaler()
cluster_scaled_df = scaler.fit_transform(cluster_df)
# Convert the scaled array back to a DataFrame with original column names
cluster_scaled_df = pd.DataFrame(cluster_scaled_df, columns=cluster_df.columns)
print(cluster_scaled_df.shape)
cluster_scaled_df.head()
(252763, 6)
| genre1 | director1 | runtimeMinutes | titleType_encoded | genre2 | averageRating | |
|---|---|---|---|---|---|---|
| 0 | -1.108926 | 3.385248 | 0.140271 | -0.790831 | -0.570000 | 0.953550 |
| 1 | -0.786646 | -0.990728 | 0.188049 | -0.790831 | -0.242822 | -2.801185 |
| 2 | -0.303226 | -0.910269 | 0.164444 | -0.790831 | 0.247946 | -1.154371 |
| 3 | 0.180195 | -1.053573 | 0.028719 | -0.790831 | -0.406411 | 0.426570 |
| 4 | 0.502475 | -0.899634 | -0.366655 | -0.295527 | -0.897178 | -0.429773 |
# Randomly sample a subset of the data
cluster_sampled = cluster_scaled_df.sample(n=25000, random_state=42)
cluster_sampled.shape
(25000, 6)
silhouette_scores = []
k_values = range(2, 10) # Trying different cluster numbers from 2 to 10
for k in k_values:
print(k)
# Fit Agglomerative Clustering model for each k value
agg_clustering = AgglomerativeClustering(n_clusters=k)
cluster_labels = agg_clustering.fit_predict(cluster_sampled)
# Calculate silhouette score
score = silhouette_score(cluster_sampled, cluster_labels)
silhouette_scores.append(score)
# Plot the Silhouette Score Method
plt.figure(figsize=(8, 6))
plt.plot(k_values, silhouette_scores, 'bo-', markersize=8)
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score Method to Determine Optimal Number of Clusters')
plt.grid(True)
plt.show()
2 3 4 5 6 7 8 9
The highest silhouette score is at k = 9, suggesting that 9 clusters might be the most optimal choice for the dataset. The increasing trend in the silhouette score means that adding more clusters helps on the separation of the clusters. However, the overall silhouette scores are relatively low (all below 0.3), indicating that even with 9 clusters, the clustering structure is not very strong. This could mean that the data points are not distinctly grouped, or that additional feature engineering might be necessary to achieve better clustering performance.
# Build AgglomerativeClustering model
agg_model = AgglomerativeClustering(n_clusters=9, affinity='euclidean', linkage='ward')
cluster_sampled['cluster_agglomerative'] = agg_model.fit_predict(cluster_sampled)
# Scatter plot of two relevant features with cluster labels color-coded
plt.figure(figsize=(10, 6))
sns.scatterplot(x='runtimeMinutes', y='averageRating', hue='cluster_agglomerative', data=cluster_sampled, palette='tab20')
plt.title('DBSCAN Clusters Visualization')
plt.show()
# Scatter plot of two relevant features with cluster labels color-coded
plt.figure(figsize=(10, 6))
sns.scatterplot(x='genre1', y='averageRating', hue='cluster_agglomerative', data=cluster_sampled, palette='tab20')
plt.title('DBSCAN Clusters Visualization')
plt.show()
# Scatter plot of two relevant features with cluster labels color-coded
plt.figure(figsize=(10, 6))
sns.scatterplot(x='director1', y='averageRating', hue='cluster_agglomerative', data=cluster_sampled, palette='tab20')
plt.title('DBSCAN Clusters Visualization')
plt.show()
# Calculate the linkage matrix
linked = linkage(cluster_sampled, method='ward')
# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, truncate_mode='lastp', p=30)
plt.title('Dendrogram for Agglomerative Clustering')
plt.xlabel('Number of points in node')
plt.ylabel('Distance')
plt.show()
The dendrogram shows the hierarchical structure of clusters formed through Agglomerative Clustering. The y axis represents the distance or dissimilarity between clusters. The large vertical jump at around a distance of 500 suggests that merging clusters beyond this point leads to a significant loss of information. Therefore, it indicates that splitting the data into two main clusters (green and orange) might be the most appropriate choice based on this dendrogram, as more divisions occur at much smaller distances.
# Compute the k-distance plot for DBSCAN
nearest_neighbors = NearestNeighbors(n_neighbors=5) # Use min_samples value
nearest_neighbors.fit(cluster_sampled)
distances, indices = nearest_neighbors.kneighbors(cluster_sampled)
# Sort the distances and plot
sorted_distances = np.sort(distances[:, -1]) # Get the distances to the k-th nearest neighbor
plt.figure(figsize=(8, 5))
plt.plot(sorted_distances)
plt.xlabel('Points Sorted by Distance')
plt.ylabel('Distance to 5th Nearest Neighbor')
plt.title('K-Distance Plot for Optimal eps Selection')
plt.grid(True)
plt.show()
# Use k-distance plot to determine optimal eps and min_samples
dbscan_model = DBSCAN(eps=1, min_samples=10)
cluster_sampled['cluster_dbscan'] = dbscan_model.fit_predict(cluster_sampled)
# Check the number of clusters found and the unique labels
unique_labels = cluster_sampled['cluster_dbscan'].unique()
print(f"Number of clusters found: {len(unique_labels) - (1 if -1 in unique_labels else 0)}")
print(f"Unique labels: {unique_labels}")
Number of clusters found: 73 Unique labels: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -1 24 25 30 26 27 33 28 29 31 32 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 68 71 61 67 62 63 64 65 66 69 70 72]
# Scatter plot of two relevant features with cluster labels color-coded
plt.figure(figsize=(10, 6))
sns.scatterplot(x='runtimeMinutes', y='averageRating', hue='cluster_dbscan', data=cluster_sampled, palette='tab20')
plt.title('DBSCAN Clusters Visualization')
plt.show()
cluster_sampled.head()
| genre1 | director1 | runtimeMinutes | titleType_encoded | genre2 | cluster_agglomerative | cluster_dbscan | |
|---|---|---|---|---|---|---|---|
| 99618 | -1.108926 | -0.962123 | 0.064125 | -0.790831 | 0.084356 | 1 | 0 |
| 28895 | -1.270066 | -0.353591 | -0.124710 | 0.199777 | -0.897178 | 2 | 1 |
| 187678 | -0.303226 | 1.262570 | -0.372556 | -0.295527 | 0.902302 | 6 | 2 |
| 17756 | -0.786646 | -0.892993 | 0.093631 | -0.790831 | -0.897178 | 1 | 0 |
| 140193 | -1.108926 | 0.633837 | 0.105433 | -0.790831 | -0.406411 | 1 | 0 |
# Calculate mean values for each cluster in Agglomerative Clustering
agg_cluster_summary = cluster_sampled.groupby('cluster_agglomerative').mean()
# Calculate mean values for each cluster in DBSCAN Clustering
dbscan_cluster_summary = cluster_sampled.groupby('cluster_dbscan').mean()
import seaborn as sns
import matplotlib.pyplot as plt
# Scatter plot of two key features with clusters color-coded
plt.figure(figsize=(8, 6))
sns.scatterplot(x='runtimeMinutes', y='averageRating', hue='cluster_agglomerative', data=cluster_sampled, palette='Set2')
plt.title('Scatter Plot of Clusters (Agglomerative)')
plt.show()
# Analyze distribution of average ratings across clusters
sns.boxplot(x='cluster_agglomerative', y='averageRating', data=cluster_sampled)
plt.title('Distribution of Average Ratings across Agglomerative Clusters')
plt.show()
Cluster 1: Lowest median rating and compact distribution, indicating a group of consistently low-rated movies.
Clusters 4, 5, 7, and 8: Higher median ratings, representing better-received movies.
Cluster 0: High variability, suggesting a mix of both well-rated and poorly-rated movies.
Clusters 2 and 3: Compact distribution around the median, indicating more consistent ratings.
I couldn't derive many meaningful insights from the cluster analysis due to the complexity and messiness of the data. With more time for deeper exploration and refinement, it's possible that more valuable insights could be uncovered.
Upload the AirPassenger data. The dataset has two columns: Month representing the time in a specific format (e.g., "1949-01" for January 1949) and #Passengers representing the number of passengers for each corresponding month.
Extract month and year from the 'Month' column. Perform summary statistics on the '#Passengers' column. Visualize the overall distribution of passenger counts using appropriate plots. Create time series plots to visually inspect trends, seasonality, and patterns. Decompose the time series into trend, seasonality, and residual components, and visualize each component separately.
Apply ARIMA, Exponential Smoothing, and Prophet models to the data. Evaluate the accuracy of all three forecasting methods using appropriate metrics such as RMSE, MAE, and MAPE. Create autocorrelation and partial autocorrelation plots to identify lag relationships in the time series data. Implement and visualize a technique like moving averages or differencing to remove trends and highlight patterns in the data.
Implement an anomaly detection method, such as a rolling average with anomaly thresholds. Visualize and analyze the identified anomalies in the context of the time series. Discuss the implications of these anomalies and how they relate to the overall patterns observed in the data. Compare the results of the anomaly detection with the findings from the forecasting models, and explain any discrepancies or correlations between the two analyses.
# Standards
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from prophet import Prophet
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Data location
data_loc = "/Users/gabrielvictorgomesferreira/Library/Mobile Documents/com~apple~CloudDocs/Work/ISU Classes/MIS 546 - Advanced Business Analytics/Data/"
file_name = "AirPassengers.csv"
# Import dataset
air_passengers = pd.read_csv(data_loc + file_name)
print("Data dimentions: ", air_passengers.shape)
air_passengers.head()
Data dimentions: (144, 2)
| Month | #Passengers | |
|---|---|---|
| 0 | 1949-01 | 112 |
| 1 | 1949-02 | 118 |
| 2 | 1949-03 | 132 |
| 3 | 1949-04 | 129 |
| 4 | 1949-05 | 121 |
air_passengers.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 144 entries, 0 to 143 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Month 144 non-null object 1 #Passengers 144 non-null int64 dtypes: int64(1), object(1) memory usage: 2.4+ KB
# Extract Year and Month from Month column
air_passengers['date'] = pd.to_datetime(air_passengers['Month'])
air_passengers['year'] = air_passengers['date'].dt.year
air_passengers['month'] = air_passengers['date'].dt.month
air_passengers = air_passengers.drop(columns='Month')
air_passengers.head()
| #Passengers | date | year | month | |
|---|---|---|---|---|
| 0 | 112 | 1949-01-01 | 1949 | 1 |
| 1 | 118 | 1949-02-01 | 1949 | 2 |
| 2 | 132 | 1949-03-01 | 1949 | 3 |
| 3 | 129 | 1949-04-01 | 1949 | 4 |
| 4 | 121 | 1949-05-01 | 1949 | 5 |
# Summary statistics of Passengers column
print("Summary Statistics for '#Passengers':\n", air_passengers['#Passengers'].describe())
Summary Statistics for '#Passengers': count 144.000000 mean 280.298611 std 119.966317 min 104.000000 25% 180.000000 50% 265.500000 75% 360.500000 max 622.000000 Name: #Passengers, dtype: float64
# Visualize the distribution of passenger counts
plt.figure(figsize=(10, 6))
sns.histplot(air_passengers['#Passengers'], kde=True, color='skyblue')
plt.title('Distribution of Air Passenger Counts')
plt.xlabel('Number of Passengers')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
Distribution Shape: The distribution is right-skewed, indicating that most months have moderate passenger counts (150-300), with fewer high-passenger months (500+).
Peak Frequency: The highest frequency is observed around 200-250 passengers, making this range the most common.
Insights: The skewness suggests occasional peak travel seasons or events. The right tail may include potential anomalies or outliers that could be explored further.
# Plot the time series to visualize overall trends
plt.figure(figsize=(12, 6))
plt.plot(air_passengers['month'], air_passengers['#Passengers'], color='blue', marker='o')
plt.title('Monthly Air Passengers Over Time')
plt.xlabel('Month')
plt.ylabel('Number of Passengers')
plt.grid(True)
plt.show()
The above line chart shows a clear seasonal pattern, with peaks occurring around mid-year (July-August) and troughs around the start and end of the year (January-December).
Each year seems to have a similar pattern in terms of seasonal peaks and troughs, suggesting strong seasonality.
airpassenger_data = air_passengers.copy()
# Set Month as the index to create a time series object
airpassenger_data.set_index('month', inplace=True)
# Decompose the time series into trend, seasonal, and residual components
decomposition = seasonal_decompose(airpassenger_data['#Passengers'], model='multiplicative', period=12)
# Plot the decomposed components
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 12))
ax1.plot(decomposition.trend, label='Trend', color='orange')
ax1.set_title('Trend Component')
ax1.grid(True)
ax2.plot(decomposition.seasonal, label='Seasonality', color='green')
ax2.set_title('Seasonal Component')
ax2.grid(True)
ax3.plot(decomposition.resid, label='Residual', color='red')
ax3.set_title('Residual Component')
ax3.grid(True)
# Show all subplots
plt.tight_layout()
plt.show()
Trend Component: The trend component is relatively flat.
Seasonal Component: The seasonal component displays the monthly variations, with July and August showing peaks, while January and December show troughs. Reinforcing our previous seasionality insight.
Residual Component: The residual component appears relatively flat, indicating that most variations in the original time series are captured by the trend and seasonal components. Some minor fluctuations suggest occasional deviations from the expected pattern.
The monthly and yearly time series show clear trends and seasonality.
The decomposition shows that the overall increase in passenger counts rusulting mostly in a positive trend, while seasonal fluctuations stays relatively stable over the years.
# Plot the time series to visualize overall trends
plt.figure(figsize=(12, 6))
plt.plot(air_passengers['year'], air_passengers['#Passengers'], color='blue', marker='o')
plt.title('Yearly Air Passengers Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.grid(True)
plt.show()
The yearly plot shows a consistent increase in the number of passengers over the years.
There are notable increases in the variance of passenger counts as well, suggesting increased fluctuations in demand over time. Each year shows a clear peak, typically occurring around mid-year.
# Set Month as the index to create a time series object
airpassenger_data.set_index('year', inplace=True)
# Decompose the time series into trend, seasonal, and residual components
decomposition = seasonal_decompose(airpassenger_data['#Passengers'], model='multiplicative', period=12)
# Plot the decomposed components
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 12))
ax1.plot(decomposition.trend, label='Trend', color='orange')
ax1.set_title('Trend Component')
ax1.grid(True)
ax2.plot(decomposition.seasonal, label='Seasonality', color='green')
ax2.set_title('Seasonal Component')
ax2.grid(True)
ax3.plot(decomposition.resid, label='Residual', color='red')
ax3.set_title('Residual Component')
ax3.grid(True)
# Show all subplots
plt.tight_layout()
plt.show()
Trend component: Shows a clear stepwise increase, indicating periods of steady growth in passenger counts. This stepwise pattern may indicate periods of economic growth or increased airline capacity, leading to sudden jumps in passenger numbers.
Seasonal component: Shows regular yearly fluctuations, with peaks and troughs occurring at the same points each year. This once again reflects the seasonality observed in the monthly analysis, where mid-year months show the highest demand.
Residual component: Shows some deviation, particularly in the earlier years, indicating that not all variations are captured by the trend and seasonality.
# Split the data into train and test sets
train = air_passengers.iloc[:-12] # All data except the last 12 months
test = air_passengers.iloc[-12:] # Last 12 months
# Fit ARIMA model on train set
arima_model = ARIMA(train['#Passengers'], order=(5, 1, 0))
arima_fit = arima_model.fit()
arima_forecast = arima_fit.forecast(steps=12) # Forecast for 12 months
# Fit Exponential Smoothing model on train set
exp_smooth = ExponentialSmoothing(train['#Passengers'], seasonal='multiplicative', seasonal_periods=12)
exp_smooth_fit = exp_smooth.fit()
exp_smooth_forecast = exp_smooth_fit.forecast(steps=12)
# Fit Prophet model on train set
prophet_train = train.reset_index().rename(columns={'date': 'ds', '#Passengers': 'y'})
prophet_model = Prophet()
prophet_model.fit(prophet_train)
future = prophet_model.make_future_dataframe(periods=12, freq='M')
prophet_forecast = prophet_model.predict(future)
prophet_forecast_values = prophet_forecast['yhat'].iloc[-12:].values # Last 12 values
23:38:02 - cmdstanpy - INFO - Chain [1] start processing 23:38:02 - cmdstanpy - INFO - Chain [1] done processing
# Create a new DataFrame for comparison
comparison_df = test[['date','#Passengers']].copy()
comparison_df['ARIMA_Prediction'] = arima_forecast.values
comparison_df['Exponential_Smoothing_Prediction'] = exp_smooth_forecast.values
comparison_df['Prophet_Prediction'] = prophet_forecast_values
# Plot actual vs. predicted values for the last 12 months
plt.figure(figsize=(12, 6))
plt.plot(comparison_df['date'], comparison_df['#Passengers'], label='Actual Data', color='blue', marker='o')
plt.plot(comparison_df['date'], comparison_df['ARIMA_Prediction'], label='ARIMA Prediction', color='orange', marker='o')
plt.plot(comparison_df['date'], comparison_df['Exponential_Smoothing_Prediction'], label='Exponential Smoothing Prediction', color='green', marker='o')
plt.plot(comparison_df['date'], comparison_df['Prophet_Prediction'], label='Prophet Prediction', color='red', marker='o')
plt.xlabel('Date')
plt.ylabel('Number of Passengers')
plt.title('Actual vs. Predicted Values for the Last 12 Months')
plt.legend()
plt.grid(True)
plt.show()
# Calculate evaluation metrics
def evaluate_forecast(actual, forecast):
rmse = np.sqrt(mean_squared_error(actual, forecast))
mae = mean_absolute_error(actual, forecast)
mape = np.mean(np.abs((actual - forecast) / actual)) * 100
return rmse, mae, mape
# Evaluate each model
arima_rmse, arima_mae, arima_mape = evaluate_forecast(test['#Passengers'], arima_forecast)
exp_rmse, exp_mae, exp_mape = evaluate_forecast(test['#Passengers'], exp_smooth_forecast)
prophet_rmse, prophet_mae, prophet_mape = evaluate_forecast(test['#Passengers'], prophet_forecast_values)
# Print the evaluation results
print(f"ARIMA: RMSE = {arima_rmse:.2f}, MAE = {arima_mae:.2f}, MAPE = {arima_mape:.2f}%")
print(f"Exponential Smoothing: RMSE = {exp_rmse:.2f}, MAE = {exp_mae:.2f}, MAPE = {exp_mape:.2f}%")
print(f"Prophet: RMSE = {prophet_rmse:.2f}, MAE = {prophet_mae:.2f}, MAPE = {prophet_mape:.2f}%")
ARIMA: RMSE = 86.81, MAE = 67.39, MAPE = 13.41% Exponential Smoothing: RMSE = 29.74, MAE = 26.71, MAPE = 5.49% Prophet: RMSE = 41.38, MAE = 34.75, MAPE = 6.98%
Exponential Smoothing outperforms both ARIMA and Prophet models based on all three evaluation metrics (RMSE, MAE, and MAPE). It is best suited for this dataset, likely due to its ability to handle seasonality and trends effectively.
Prophet performs moderately well but has higher error values compared to Exponential Smoothing.
ARIMA has the highest error values among the three models, indicating it may not be well-suited for capturing the seasonal patterns in the data. Overall, Exponential Smoothing appears to be the most reliable model for forecasting air passenger counts in this dataset.
# ACF and PACF plots
plt.figure(figsize=(12, 6))
plt.subplot(121)
plot_acf(air_passengers['#Passengers'], lags=30, ax=plt.gca())
plt.title('Autocorrelation Function')
plt.subplot(122)
plot_pacf(air_passengers['#Passengers'], lags=30, ax=plt.gca())
plt.title('Partial Autocorrelation Function')
plt.tight_layout()
plt.show()
# Calculate the moving average
air_passengers['Moving_Avg'] = air_passengers['#Passengers'].rolling(window=12).mean()
# Plot original vs. moving average
plt.figure(figsize=(12, 6))
plt.plot(air_passengers['#Passengers'], label='Original')
plt.plot(air_passengers['Moving_Avg'], label='Moving Average', color='red')
plt.title('Original vs. Moving Average (12 months)')
plt.legend()
plt.grid(True)
plt.show()
# Calculate the rolling mean and standard deviation
air_passengers['Rolling_Mean'] = air_passengers['#Passengers'].rolling(window=12).mean()
air_passengers['Rolling_Std'] = air_passengers['#Passengers'].rolling(window=12).std()
# Define upper and lower thresholds based on standard deviation
air_passengers['Upper_Threshold'] = air_passengers['Rolling_Mean'] + 2 * air_passengers['Rolling_Std']
air_passengers['Lower_Threshold'] = air_passengers['Rolling_Mean'] - 2 * air_passengers['Rolling_Std']
# Identify anomalies where the actual passenger count is outside the threshold bounds
air_passengers['Anomaly'] = ((air_passengers['#Passengers'] > air_passengers['Upper_Threshold']) |
(air_passengers['#Passengers'] < air_passengers['Lower_Threshold']))
# Filter the anomalies
anomalies = air_passengers[air_passengers['Anomaly']]
# Plot the results
plt.figure(figsize=(12, 6))
plt.plot(air_passengers['#Passengers'], label='Passenger Count', color='blue')
plt.plot(air_passengers['Rolling_Mean'], label='Rolling Mean', color='orange', linestyle='--')
plt.fill_between(air_passengers.index, air_passengers['Upper_Threshold'], air_passengers['Lower_Threshold'],
color='gray', alpha=0.2, label='Thresholds')
plt.scatter(anomalies.index, anomalies['#Passengers'], color='red', label='Anomalies', s=100)
plt.title('Anomaly Detection using Rolling Average')
plt.xlabel('Date')
plt.ylabel('Number of Passengers')
plt.legend()
plt.grid(True)
plt.show()
print("Anomalies Detected:")
print(anomalies[['#Passengers', 'Rolling_Mean', 'Upper_Threshold', 'Lower_Threshold']])
Anomalies Detected:
#Passengers Rolling_Mean Upper_Threshold Lower_Threshold
date
1950-07-01 170 132.166667 168.436698 95.896636
1954-07-01 302 229.583333 301.802486 157.364180
1955-07-01 364 264.416667 352.116395 176.716938
1956-07-01 413 312.000000 408.419538 215.580462
1957-07-01 465 350.416667 463.651606 237.181727
1960-07-01 622 459.416667 613.805687 305.027646
Anomalies Detected: All anomalies occur in July during the years 1950, 1954, 1955, 1956, 1957, and 1960, indicating unusually high mid-year passenger counts.
Relation to Patterns: These anomalies correspond to peak travel periods (summer months) and are consistent with seasonal spikes observed in the overall time series.
Model Performance Comparison: Exponential Smoothing captured these seasonal peaks best, resulting in low RMSE and MAPE values. ARIMA struggled to capture these anomalies, leading to higher errors. Prophet performed moderately, capturing some peaks but not as accurately as Exponential Smoothing.
Conclusion: The anomalies align with expected seasonal trends, and Exponential Smoothing was best suited to model these patterns. Future models should consider incorporating external seasonal factors for improved accuracy.
For this homework, I primarily relied on ChatGPT as my AI tool.
It often helps me a lot to understand and break down complex concepts into smaller, manageable parts, which I could then piece together to improve my understanding. The explanations are detailed and easy to follow, and this approach significantly enhanced my learning experience lately.
Additionally, ChatGPT helps a lot on my writing. It helps me to articulate my ideas more clearly and concisely. There was a particular instance where I spent about 25 minutes trying to formulate an explanation. After using ChatGPT to refine it, the core meaning of my original response remained intact, but the clarity and brevity were improved a lot. And every time I use it, I ask tips to improve my writting as well.
Another area where ChatGPT helped was with data visualization. Previously, I would spend considerable time searching for the right code on platforms like StackOverflow to generate plots. With ChatGPT, I was able to speed up this process, quickly getting the necessary plotting code and visualizing the data more efficiently.